2014-09-27 95 views
0

我似乎遇到了這樣的錯誤:文本被寫入文件兩次,第一次格式不正確,第二次格式正確。 The method below takes in this URL after it's been converted properly.該方法應該在所有正文文本所在的分隔符「ffaq」的子節點的分隔符的所有子節點的文本轉換之間打印換行符。任何幫助,將不勝感激。我對使用jsoup相當陌生,所以解釋也會很好。Jsoup在寫入文件時解析html複製

/** 
* Method to deal with HTML 5 Gamefaq entries. 
* @param url The location of the HTML 5 entry to read. 
**/ 
public static void htmlDocReader(URL url) { 
    try { 
     Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString()); 
     //parse pagination label 
     String[] num = doc.select("div.span12"). 
           select("ul.paginate"). 
           select("li"). 
           first(). 
           text(). 
           split("\\s+"); 
     //get the max page number 
     final int max_pagenum = Integer.parseInt(num[num.length - 1]); 

     //create a new file based on the url path 
     File file = urlFile(url); 
     PrintWriter outFile = new PrintWriter(file, "UTF-8"); 

     //Add every page to the text file 
     for(int i = 0; i < max_pagenum; i++) { 
      //if not the first page then change the url 
      if(i != 0) { 
       String new_url = url.toString() + "?page=" + i; 
       doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8", 
            new_url.toString()); 
      } 
      Elements walkthroughs = doc.select("div.ffaq"); 
       for(Element elem : walkthroughs.select("div")) { 
        for(Element inner : elem.children()) { 
         outFile.println(inner.text()); 
        } 
       } 
     } 
     outFile.close(); 
    } catch(Exception e) { 
     e.printStackTrace(); 
     System.exit(1); 
    } 
} 

回答

1

對於您稱爲text()的每個元素,您都會打印其結構的所有文本。 假設下面的例子

<div> 
text of div 
<span>text of span</span> 
</div> 

如果調用text()div element你會得到

文本範圍

的格文本,然後,如果你打電話text()跨度,你會得到

text of span

您需要什麼,以避免重複是使用ownText()。這將只獲得元素的直接文本,而不是其子元素的文本。

說來話長排序改變這種

for(Element elem : walkthroughs.select("div")) { 
    for(Element inner : elem.children()) { 
     outFile.println(inner.text()); 
    } 
} 

對此

for(Element elem : walkthroughs.select("div")) { 
    for(Element inner : elem.children()) { 
     String line = inner.ownText().trim(); 
     if(!line.equals("")) //Skip empty lines 
      outFile.println(line); 
    } 
}