從Java中的HTML中提取信息（解析）的最簡單方法

我已閱讀了有關html解析的大量關於stackoverflow的問題。我瞭解到，如果可能的話，我們應該避免使用正則表達式，而是使用解析器。我知道有很多Html/Xml解析器，但我不知道如何正確使用它們。從Java中的HTML中提取信息（解析）的最簡單方法

考慮這個html，通過jTidy解析。可是我不得不對jTidy此代碼創建的文檔對象：現在

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> 
<head> 
    <!-- Header content --> 
</head> 
<body> 
    <div id="container"> 
     <div id="id1"> ... </div> 
     <div id="id2"> ... </div> 
     <div id="mainContent"> 
      <div id="section 1"> 
       <div id="subSection"> 
        <!-- Interested part --> 
        <tbody> 
         <tr class="success"> 
          <td class="fileName"><span>File One</span></td> 
         </tr> 
         <tr class="fail"> 
          <td class="fileName"><span>File Two</span></td> 
         </tr>       
         <tr class="success"> 
          <td class="fileName"><span>File Three</span></td> 
         </tr> 
        </tbody> 
       </div> 
      </div> 
     </div> 
    </div> 
</body>

，我要地圖（在地圖：d）與同級車每個文件名（成功/失敗）。我可以用DOM做，但我應該創建一個NodeList，併爲每個元素創建一個新的節點列表（大量的內存和無聊）。還有其他選擇，如薩克斯，Xerces等，但我不知道他們的優點/缺點。

從上面的「jTyded」html中提取這些信息的最簡單（最快）的方法是什麼？

來源

2012-02-26 Angelo

使用XPath http://stackoverflow.com/questions/7049150/how -to-extract-data-using-jtidy-and-xpath – Greg 2012-02-26 18:13:19

我讀過XPath，但問題是我應該： 1）爲文件名創建模式 2）爲類01創建模式3）準類/文件名這不是很簡單 – Angelo 2012-02-26 18:17:59

如何HtmlUnit：http://htmlunit.sourceforge.net/ – 2012-02-26 18:46:17

首先 - 您忘了添加<table>標籤。

你可以很容易解析您Jsoup

下面的代碼是一個例子：

// String html =" ...here goes your html code... "; 
// Document doc = Jsoup.parse(html); 
// Or from file: 
    File input = new File("com.htm"); 
    Document doc = Jsoup.parse(input, "UTF-8"); 
    Elements trs = doc.select("tr"); //select all "tr" elements from document 
    for(Element tr:trs){ 
     //Getting the class string form tr element 
     System.out.println("The file class is: " + tr.attr("class") 
     //getting the filename string that holds inside td element 
     + " The filamee is: " + tr.select("td").text()); 
    } 
}

來源

2012-02-27 10:36:53 vacuum

謝謝。由於縮進太多，

被忽略。再次感謝！ – Angelo2012-02-27 11:19:43

在我看來，最好的方法是使用XSLT + XPath（如格雷格在評論中所建議的）爲了產生unmarshaller的輸入。

因此，整個流程如下所示： HTML - > [jTidy purifying] - > XHTL - > [XSLT轉換] - >字符串數據表示 - > [JAXB unmarshaller] - > Java對象。

如果你不希望有生產的對象，在這個線程描述只使用XPath的：How to read XML using XPath in Java

來源

2012-02-26 18:48:35

嘗試JSoup。

來源

2012-02-26 19:07:19 Dipin

相關問題

從Java中的HTML中提取信息（解析）的最簡單方法

回答

相關問題