Jsoup - 提取文本

我需要從這樣的節點中提取文本：Jsoup - 提取文本

<div> 
    Some text <b>with tags</b> might go here. 
    <p>Also there are paragraphs</p> 
    More text can go without paragraphs<br/> 
</div>

而且我需要建立：

Some text <b>with tags</b> might go here. 
Also there are paragraphs 
More text can go without paragraphs

Element.text剛剛返回的div的所有內容。 Element.ownText - 所有不在孩子內部的元素。兩者都是錯誤的。遍歷children會忽略文本節點。

是否有辦法迭代元素的內容以接收文本節點。例如。

文本節點 - 一些文本
節點< B> - 與標籤
文本節點 - 可能會去這裏。
節點< P> - 還有一些段落
文本節點 - 更多的文字可以去無段落
節點< BR> - <空>

來源

2012-04-16 Eugene Retunsky

Element.children()返回Elements對象 - 列表Element對象。查看父類Node，您將看到一些方法可讓您訪問任意節點，而不僅僅是元素，例如Node.childNodes()。

public static void main(String[] args) throws IOException { 
    String str = "<div>" + 
      " Some text <b>with tags</b> might go here." + 
      " <p>Also there are paragraphs</p>" + 
      " More text can go without paragraphs<br/>" + 
      "</div>"; 

    Document doc = Jsoup.parse(str); 
    Element div = doc.select("div").first(); 
    int i = 0; 

    for (Node node : div.childNodes()) { 
     i++; 
     System.out.println(String.format("%d %s %s", 
       i, 
       node.getClass().getSimpleName(), 
       node.toString())); 
    } 
}

結果：

 
1 TextNode 
Some text 
2 Element <b>with tags</b> 
3 TextNode might go here. 
4 Element <p>Also there are paragraphs</p> 
5 TextNode More text can go without paragraphs 
6 Element <br/>

來源

2012-04-16 20:45:27

完美的作品，謝謝！ – 2012-04-16 20:49:47

for (Element el : doc.select("body").select("*")) { 

     for (TextNode node : el.textNodes()) { 

        node.text())); 

     } 

    }

來源

2013-08-13 21:10:25 Charles

假設你想純文本（無標籤）我的解決方案如下。
輸出結果爲：
某些帶有標記的文字可能會在此處顯示。還有段落。更多的文字可以去無段落

public static void main(String[] args) throws IOException { 
    String str = 
       "<div>" 
      + " Some text <b>with tags</b> might go here." 
      + " <p>Also there are paragraphs.</p>" 
      + " More text can go without paragraphs<br/>" 
      + "</div>"; 

    Document doc = Jsoup.parse(str); 
    Element div = doc.select("div").first(); 
    StringBuilder builder = new StringBuilder(); 
    stripTags(builder, div.childNodes()); 
    System.out.println("Text without tags: " + builder.toString()); 
} 

/** 
* Strip tags from a List of type <code>Node</code> 
* @param builder StringBuilder : input and output 
* @param nodesList List of type <code>Node</code> 
*/ 
public static void stripTags (StringBuilder builder, List<Node> nodesList) { 

    for (Node node : nodesList) { 
     String nodeName = node.nodeName(); 

     if (nodeName.equalsIgnoreCase("#text")) { 
      builder.append(node.toString()); 
     } else { 
      // recurse 
      stripTags(builder, node.childNodes()); 
     } 
    } 
}

來源

2014-12-16 20:21:27

您可以使用TextNode用於此目的：

List<TextNode> bodyTextNode = doc.getElementById("content").textNodes(); 
    String html = ""; 
    for(TextNode txNode:bodyTextNode){ 
     html+=txNode.text(); 
    }

來源

2015-07-21 18:41:40

Jsoup - 提取文本

回答

相關問題