2016-07-28 143 views
3

這是我的代碼來拆分網址,但該代碼有問題。所有鏈接均以雙字出現,例如www.utem.edu.my/portal/portal。詞/門戶/門戶總是出現在任何鏈接中的兩倍。任何建議我提取網頁中的鏈接?如何分割網址?

public String crawlURL(String strUrl) { 
    String results = ""; // For return 
    String protocol = "http://"; 

    // Assigns the input to the inURL variable and checks to add http 
    String inURL = strUrl; 
    if (!inURL.toLowerCase().contains("http://".toLowerCase()) && 
      !inURL.toLowerCase().contains("https://".toLowerCase())) { 
     inURL = protocol + inURL; 
    } 

    // Pulls URL contents from the web 
    String contectURL = pullURL(inURL); 
    if (contectURL == "") { // If it fails, then try with https 
     protocol = "https://"; 
     inURL = protocol + inURL.split("http://")[1]; 
     contectURL = pullURL(inURL); 
    } 

    // Declares some variables to be used inside loop 
    String aTagAttr = ""; 
    String href = ""; 
    String msg = ""; 

    // Finds A tag and stores its href value into output var 
    String bodyTag = contectURL.split("<body")[1]; // Find 1st <body> 
    String[] aTags = bodyTag.split(">"); // Splits on every tag 

    //To show link different from one another 
    int index = 0; 

    for (String s: aTags) { 
    // Process only if it is A tag and contains href 
    if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) { 

     aTagAttr = s.split("href")[1]; // Split on href 

     // Split on space if it contains it 
     if (aTagAttr.toLowerCase().contains("\\s")) 
      aTagAttr = aTagAttr.split("\\s")[2]; 

     // Splits on the link and deals with " or ' quotes 
     href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1]; 

     if (!results.toLowerCase().contains(href)) 
      //results += "~~~ " + href + "\r\n"; 

     /* 
     * Last touches to URl before display 
     *  Adds http(s):// if not exist 
     *  Adds base url if not exist 
     */ 

     if(results.toLowerCase().indexOf("about") != -1) { 
      //Contains 'about' 
     } 
     if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) { 

      // http:// + baseURL + href 
      if (!href.toLowerCase().contains(inURL.split("://")[1])) 
       href = protocol + inURL.split("://")[1] + href; 
      else 
       href = protocol + href; 
     } 

     System.out.println(href);//debug 
+0

你有'if(!results.toLowerCase()。contains(href))// results + =「~~~」+ href +「\ r \ n」;'這會導致錯誤,因爲沒有如果應用到代碼的不同部分,而不是因爲某些東西被評論而沒有做任何事情噸。 – martijnn2008

回答

4

考慮使用URL類...

使用它通過文件的建議: )

public static void main(String[] args) throws Exception { 

     URL aURL = new URL("http://example.com:80/docs/books/tutorial" 
          + "/index.html?name=networking#DOWNLOADING"); 

     System.out.println("protocol = " + aURL.getProtocol()); 
     System.out.println("authority = " + aURL.getAuthority()); 
     System.out.println("host = " + aURL.getHost()); 
     System.out.println("port = " + aURL.getPort()); 
     System.out.println("path = " + aURL.getPath()); 
     System.out.println("query = " + aURL.getQuery()); 
     System.out.println("filename = " + aURL.getFile()); 
     System.out.println("ref = " + aURL.getRef()); 
    } 
} 

輸出:

協議= HTTP

權威= ex ample.com:80

主機= example.com

端口= 80

在這之後你可以把你需要創建一個新的字符串/ URL的元素: )

+0

謝謝。 :)關於這段代碼的任何建議href = protocol + inURL.split(「://」)[1] + href;因爲我認爲這部分導致鏈接加倍。請幫幫我 – Jenna