XML文本提取

考慮下面的XML文件：

<a:root 
xmlns:h="http://www.w3.org/TR/html4/" 
xmlns:f="http://www.w3schools.com/furniture"> 

<h:table> 
    <h:tr> 
    <h:td>Apples</h:td> 
    <h:td>Bananas</h:td> 
    </h:tr> 
</h:table> 

<f:table> 
    <f:name>African Coffee Table</f:name> 
    <f:width>80</f:width> 
    <f:length>120</f:length> 
</f:table> 

aaaaaaaaaaaaaa 

</a:root>

我如何提取的主要元素<a:root>中的文本：

"\naaaaaaaaaaaaaa\n"

我的代碼現在是：

import java.io.File; 
import java.util.Stack; 

import javax.xml.parsers.DocumentBuilder; 
import javax.xml.parsers.DocumentBuilderFactory; 

import org.w3c.dom.Document; 
import org.w3c.dom.NodeList; 


public class Proof { 
    public static void main(String[] args) { 
     Document doc = null; 
     DocumentBuilderFactory dbf = null; 
     DocumentBuilder docBuild = null; 
     try { 

      dbf = DocumentBuilderFactory.newInstance(); 
      docBuild = dbf.newDocumentBuilder(); 
      doc = docBuild.parse(new File("test2.xml")); 

      System.out.println(doc.getFirstChild().getTextContent()); 
     } catch(Exception e) { 
      e.printStackTrace(); 
     } 
    } 
}

但它返回我想要的文本（「aaaaaaaaaaaaaa」）+其餘元素的內部文本。輸出：

Apples 
    Bananas 




    African Coffee Table 
    80 
    120 


aaaaaaaaaaaaaa

的要求是不使用額外的XML的Java庫！

來源

2011-09-03 Andrei Ciobanu

好問題，+1。請參閱我的答案，以獲取正確，簡短且簡單的XPath單行表達式，以便精確選擇想要的文本節點。 :) –

@Dimitre Novatchev，我認爲你需要降低自我重要性。此時我無法提供Java代碼，但我提供了C＃代碼，據我所知，您不僅僅是XML專家，還有.NET專家;-)，因此您可以檢查結果：var result = doc .SelectNodes（@「a：root/text（）」，xmlnsManager）.OfType （）;'。結果應該是'\ r \ n \ r \ n \ r \ n' ...- :-) –

@Kirill Polishchuk：用Saxon或AltovaXML運行您的代碼並計算文本節點的數量 - 您的代碼生成 - - 通過純粹的運氣 - 只有使用某些（微軟）XSLT處理器的預期結果，因爲它們的默認設置是剝離只包含空白的文本節點。這裏我們不是講「自重」，而是講基礎知識（缺乏）。 –

通過@Kirill舒克答案是不corect：

提議：

a:root/text()

是一個相對錶達並且如果未評價它具有根（/）節點作爲上下文節點，它選擇沒有提供的XML文檔英寸
即使XPath表達式：/a:root/text()是不正確，因爲它選擇三個文本節點 - 頂級元素的所有文本子節點 - 其中包括兩個空格，只有文本節點。

這裏是一個正確的XPath溶液：

/a:root/text()[string-length(normalize-space()) > 0]

當該XPath表達式所提供的XML文檔施加（校正爲良好的形成）：

<a:root 
xmlns:a="UNDEFINED !!!!" 
xmlns:h="http://www.w3.org/TR/html4/" 
xmlns:f="http://www.w3schools.com/furniture"> 

<h:table> 
    <h:tr> 
    <h:td>Apples</h:td> 
    <h:td>Bananas</h:td> 
    </h:tr> 
</h:table> 

<f:table> 
    <f:name>African Coffee Table</f:name> 
    <f:width>80</f:width> 
    <f:length>120</f:length> 
</f:table> 

aaaaaaaaaaaaaa 

</a:root>

它根據需要選擇頂層元素的最後（也是唯一的非空白）文本節點子元素：

aaaaaaaaaaaaaa

基於XSLT的驗證：

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:a="UNDEFINED !!!!" 
> 
<xsl:output omit-xml-declaration="yes" indent="yes"/> 

<xsl:template match="/"> 
    <xsl:text>"</xsl:text> 
    <xsl:copy-of select= 
    "/a:root/text() 
      [string-length(normalize-space()) > 0]"/>" 

</xsl:template> 
</xsl:stylesheet>

當該變換是針對所提供的XML文檔（以上），施加有用，正確selecte文本節點輸出：

" 

aaaaaaaaaaaaaa 

"

來源

2011-09-04 03:07:29

矯枉過正。 'a：root/text（）'將精確選擇1個文本節點（2個只包含空白的文本節點將被刪除）。 –

您可以使用XPath：a:root/text()

來源

2011-09-03 11:56:11

+1 - 使用Java SE 5及更高版本中的javax.xml.xpath API。 –

這個XPath表達式至少有兩個問題阻止它完全選擇OP所需的文本節點 - 請參閱我的答案以獲取更多詳細信息。 –

使用此

import java.io.File; 
import java.util.Stack; 

import javax.xml.parsers.DocumentBuilder; 
import javax.xml.parsers.DocumentBuilderFactory; 

import org.w3c.dom.Document; 
import org.w3c.dom.NodeList; 


public class Proof { 
public static void main(String[] args) { 
    Document doc = null; 
    DocumentBuilderFactory dbf = null; 
    DocumentBuilder docBuild = null; 
    try { 

     dbf = DocumentBuilderFactory.newInstance(); 
     docBuild = dbf.newDocumentBuilder(); 
     doc = docBuild.parse(new File("test2.xml")); 

     Element x= doc.getDocumentElement(); 
     NodeList m=x.getChildNodes(); 
     for(int i=0;i<m.getLength();i++){ 
      Node it=m.item(i); 
      if(it.getNodeType()==3){ 
       System.out.println(it.getNodeValue()); 
      } 
     } 
    } catch(Exception e) { 
     e.printStackTrace(); 
    } 
}

}

來源

2011-09-03 12:36:58

XML文本提取

回答

相關問題