2014-09-19 90 views
1

我正在對web文檔進行結構分析。爲此,我需要僅提取Web文檔的結構(只有標籤)。我發現了一個名爲Jsoup的Java的html解析器。但我不知道如何使用它來提取標籤。使用Jsoup從html文件中提取標籤

實施例:

<html> 
<head> 
    this is head 
</head> 
<body> 
    this is body 
</body> 
</html> 

輸出:

html,head,head,body,body,html 
+0

爲(元件EL:doc.select( 「*」)){ \t的System.out.println(el.nodeName()); } 已經給出了分析結果:html,head,body如果文檔格式良好,很明顯你會得到成對的標籤。 – 2014-09-26 08:39:42

回答

2

聽起來像一個深度優先遍歷:

public class JsoupDepthFirst { 

    private static String htmlTags(Document doc) { 
     StringBuilder sb = new StringBuilder(); 
     htmlTags(doc.children(), sb); 
     return sb.toString(); 
    } 

    private static void htmlTags(Elements elements, StringBuilder sb) { 
     for(Element el:elements) { 
      if(sb.length() > 0){ 
       sb.append(","); 
      } 
      sb.append(el.nodeName()); 
      htmlTags(el.children(), sb); 
      sb.append(",").append(el.nodeName()); 
     } 
    } 

    public static void main(String... args){ 
     String s = "<html><head>this is head </head><body>this is body</body></html>"; 
     Document doc = Jsoup.parse(s); 
     System.out.println(htmlTags(doc)); 
    } 
} 

另一種解決方案是如下使用jsoup NodeVisitor:

SecondSolution ss = new SecondSolution(); 
    doc.traverse(ss); 
    System.out.println(ss.sb.toString()); 

類:

public static class SecondSolution implements NodeVisitor { 

     StringBuilder sb = new StringBuilder(); 

     @Override 
     public void head(Node node, int depth) { 
      if (node instanceof Element && !(node instanceof Document)) { 
       if (sb.length() > 0) { 
        sb.append(","); 
       } 
       sb.append(node.nodeName()); 
      } 
     } 

     @Override 
     public void tail(Node node, int depth) { 
      if (node instanceof Element && !(node instanceof Document)) { 
       sb.append(",").append(node.nodeName()); 
      } 
     } 
    } 
+0

我用了第二個。工作正常!!謝謝@ user1121883 – 2014-09-19 09:17:09