維基百科第一段

我正在寫一些Java代碼，以便在使用維基百科的文本中實現NLP任務。我如何使用JSoup來提取維基百科文章的第一段？維基百科第一段

非常感謝。

2011-11-27 Lida

這非常簡單，並且對於從中提取信息的每個半結構化頁面而言，該過程都非常相似。

首先，你必須唯一標識DOM元素，其中所需要的信息就在於要做到這一點是使用Web開發工具最簡單的方法，如Firebug在Firefox或附帶捆綁的那些IE（> 6，我認爲）和Chrome。

使用文章Potato作爲一個例子，你會發現，<p> aragraph你感興趣的是，在以下塊：

<div class="mw-content-ltr" lang="en" dir="ltr"> 
    <div class="metadata topicon" id="protected-icon" style="display: none; right: 55px;">[...]</div> 
    <div class="dablink">[...]</div> 
    <div class="dablink">[...]</div> 
    <div>[...]</div> 
    <p>The potato [...]</p> 
    <p>[...]</p> 
    <p>[...]</p>

換句話說，你想找到的第一個<p>元素在div之內，class稱爲mw-content-ltr。

然後，您只需要選擇與jsoup該元素，例如使用其選擇的語法（這是非常類似jQuery的）：

public class WikipediaParser { 
    private final String baseUrl; 

    public WikipediaParser(String lang) { 
    this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang); 
    } 

    public String fetchFirstParagraph(String article) throws IOException { 
    String url = baseUrl + article; 
    Document doc = Jsoup.connect(url).get(); 
    Elements paragraphs = doc.select(".mw-content-ltr p"); 

    Element firstParagraph = paragraphs.first(); 
    return firstParagraph.text(); 
    } 

    public static void main(String[] args) throws IOException { 
    WikipediaParser parser = new WikipediaParser("en"); 
    String firstParagraph = parser.fetchFirstParagraph("Potato"); 
    System.out.println(firstParagraph); // prints "The potato is a starchy [...]." 
    } 
}

來源

2011-11-27 16:41:50

你好，非常感謝你的確。建議的解決方案完美運作。 – Lida

這似乎是第一款也是第一<p>塊在文件中。所以這可能工作：

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/B-tree").get(); 
Elements paragraphs = doc.select("p"); 
Element firstParagraph = paragraphs.first();

現在你可以得到這個元素

來源

2011-11-27 16:42:49 hage

'getElementsByClass（）'按類名返回元素，而不是按標籤名稱。 – BalusC

@BalusC哦，是的，你說得對。我更新了我的答案。 – hage

席爾瓦提出的解決方案中的「JavaScript」和「United States」適用於大多數情況下，除了喜歡的內容。段落應選爲doc.select（「。mw-body-content p」）;

檢查this GitHub代碼的更多細節。您還可以從HTML中刪除一些元數據信息以提高準確性。

來源

2016-07-13 22:13:49

維基百科第一段

回答

相關問題