基本上我打算使用維基百科API沙箱在根節點「Economics」下提取維基百科中的整個類別樹。我不需要文章的內容,我只需要幾個基本的細節,比如pageid,標題,修訂歷史(在我的工作的某個後期階段)。到目前爲止,我可以逐級提取它,但我想要的是一個遞歸/迭代函數。 每個類別包含一個類別和文章(如每個根包含節點和葉子)。 我寫了一個代碼將第一級提取到文件中。一個文件包含文章,第二個文件夾包含類別的名稱(可以進一步分類的根的女兒)。 然後我進入關卡,並使用類似的代碼提取他們的類別,文章和子類別。 代碼在每種情況下都保持類似,但它的可擴展性。我需要到達所有節點的最低葉子。所以我需要一個不斷檢查直到結束的遞歸。 我將包含類別的文件標記爲'c_',因此我可以在提取不同級別時提供條件。 現在由於某種原因,它進入了一個僵局,並一直添加相同的東西。我需要一條擺脫僵局的出路。類別使用Java的維基百科中的樹抽取
package wikiCrawl;
import java.awt.List;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Scanner;
import org.apache.commons.io.FileUtils;
import org.json.CDL;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
public class SubCrawl
{
public static void main(String[] args) throws IOException, InterruptedException, JSONException
{ File file = new File("C:/Users/User/Desktop/Root/Economics_2.txt");
crawlfile(file);
}
public static void crawlfile(File food) throws JSONException, IOException ,InterruptedException
{
ArrayList<String> cat_list =new ArrayList <String>();
Scanner scanner_cat = new Scanner(food);
scanner_cat.useDelimiter("\n");
while (scanner_cat.hasNext())
{
String scan_n = scanner_cat.next();
if(scan_n.indexOf(":")>-1)
cat_list.add(scan_n.substring(scan_n.indexOf(":")+1));
}
System.out.println(cat_list);
//get the categories in different languages
URL category_json;
for (int i_cat=0; i_cat<cat_list.size();i_cat++)
{
category_json = new URL("https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A"+cat_list.get(i_cat).replaceAll(" ", "%20").trim()+"&cmlimit=500"); //.trim() removes trailing and following whitespaces
System.out.println(category_json);
HttpURLConnection urlConnection = (HttpURLConnection) category_json.openConnection(); //Opens the connection to the URL so clients can communicate with the resources.
BufferedReader reader = new BufferedReader (new InputStreamReader(category_json.openStream()));
String line;
String diff = "";
while ((line = reader.readLine()) != null)
{
System.out.println(line);
diff=diff+line;
}
urlConnection.disconnect();
reader.close();
JSONArray jsonarray_cat = new JSONArray (diff.substring(diff.indexOf("[{\"pageid\"")));
System.out.println(jsonarray_cat);
//Loop categories
for (int i_url = 0; i_url<jsonarray_cat.length();i_url++) //jSONarray is an array of json objects, we are looping through each object
{
//Get the URL _part (Categorie isn't correct)
int pageid=Integer.parseInt(jsonarray_cat.getJSONObject(i_url).getString("pageid")); //this can be written in a much better way
System.out.println(pageid);
String title=jsonarray_cat.getJSONObject(i_url).getString("title");
System.out.println(title);
File food_year= new File("C:/Users/User/Desktop/Root/"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
File food_year2= new File("C:/Users/User/Desktop/Root/c_"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
food_year.createNewFile();
food_year2.createNewFile();
BufferedWriter writer = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year, true)));
BufferedWriter writer2 = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year2, true)));
if (title.contains("Category:"))
{
writer2.write(pageid+";"+title);
writer2.newLine();
writer2.flush();
crawlfile(food_year2);
}
else
{
writer.write(pageid+";"+title);
writer.newLine();
writer.flush();
}
}
}
}
}
對於初學者來說,這可能對維基媒體服務器的需求過大。有超過一百萬個類別(https://stats.wikimedia.org/EN/TablesWikipediaEN.htm#namespaces),您需要閱讀https://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime。 3F –