使用Jsoup從html文件中提取標籤

我正在對web文檔進行結構分析。爲此，我需要僅提取Web文檔的結構（只有標籤）。我發現了一個名爲Jsoup的Java的html解析器。但我不知道如何使用它來提取標籤。使用Jsoup從html文件中提取標籤

實施例：

<html> 
<head> 
    this is head 
</head> 
<body> 
    this is body 
</body> 
</html>

輸出：

html,head,head,body,body,html

來源

2014-09-19 vignesh babu

爲（元件EL：doc.select（「*」））{ \t的System.out.println（el.nodeName（））; } 已經給出了分析結果：html，head，body如果文檔格式良好，很明顯你會得到成對的標籤。 – 2014-09-26 08:39:42

聽起來像一個深度優先遍歷：

public class JsoupDepthFirst { 

    private static String htmlTags(Document doc) { 
     StringBuilder sb = new StringBuilder(); 
     htmlTags(doc.children(), sb); 
     return sb.toString(); 
    } 

    private static void htmlTags(Elements elements, StringBuilder sb) { 
     for(Element el:elements) { 
      if(sb.length() > 0){ 
       sb.append(","); 
      } 
      sb.append(el.nodeName()); 
      htmlTags(el.children(), sb); 
      sb.append(",").append(el.nodeName()); 
     } 
    } 

    public static void main(String... args){ 
     String s = "<html><head>this is head </head><body>this is body</body></html>"; 
     Document doc = Jsoup.parse(s); 
     System.out.println(htmlTags(doc)); 
    } 
}

另一種解決方案是如下使用jsoup NodeVisitor：

SecondSolution ss = new SecondSolution(); 
    doc.traverse(ss); 
    System.out.println(ss.sb.toString());

類：

public static class SecondSolution implements NodeVisitor { 

     StringBuilder sb = new StringBuilder(); 

     @Override 
     public void head(Node node, int depth) { 
      if (node instanceof Element && !(node instanceof Document)) { 
       if (sb.length() > 0) { 
        sb.append(","); 
       } 
       sb.append(node.nodeName()); 
      } 
     } 

     @Override 
     public void tail(Node node, int depth) { 
      if (node instanceof Element && !(node instanceof Document)) { 
       sb.append(",").append(node.nodeName()); 
      } 
     } 
    }

來源

2014-09-19 08:42:15 user1121883

我用了第二個。工作正常！！謝謝@ user1121883 – 2014-09-19 09:17:09

使用Jsoup從html文件中提取標籤

回答

相關問題