2017-07-04 74 views
0

我正在學習一個教程,在java中創建一個web爬蟲。當我運行代碼時,我的crawledURLnull。 ***格式不正確的URL:在無限循環中爲null。爲什麼我的crawledURL爲空?

任何人都可以向我解釋爲什麼會發生這種情況?

這裏是整個代碼:

import java.util.*; 
import java.util.regex.Matcher; 
import java.util.regex.Pattern; 
import java.io.*; 
import java.net.*; 

public class WebCrawler { 

public static Queue<String> Queue = new LinkedList<>(); 
public static Set<String> marked = new HashSet<>(); 
public static String regex = "http[s]://(\\w+\\.)*(\\w+)"; 

public static void bfsAlgorithm(String root) throws IOException { 

    Queue.add(root); 

    while (!Queue.isEmpty()) { 

     String crawledURL = Queue.poll(); 
     System.out.println("\n=== Site crawled : " + crawledURL + "==="); 

     //Limiting to a 100 websites here 

     if(marked.size() > 100) 
      return; 

     boolean ok = false; 
     URL url = null; 
     BufferedReader br = null; 

     while (!ok) { 
      try { 
       url = new URL(crawledURL); 
       br = new BufferedReader(new InputStreamReader(url.openStream())); 
       ok = true; 

      } catch (MalformedURLException e) { 
       System.out.println("*** Malformed URL :" + crawledURL); 
       crawledURL = Queue.poll(); 
       ok = false; 

      } catch (IOException ioe) { 
       System.out.println("*** IOException for URL :" + crawledURL); 
       crawledURL = Queue.poll(); 
       ok = false; 


     } 

    } 

     StringBuilder sb = new StringBuilder(); 

     while((crawledURL = br.readLine()) != null) { 
      sb.append(crawledURL); 
     } 

     crawledURL = sb.toString(); 
     Pattern pattern = Pattern.compile(regex); 
     Matcher matcher = pattern.matcher(crawledURL); 


     while (matcher.find()){ 

      String w = matcher.group(); 

      if (!marked.contains(w)) { 
       marked.add(w); 
       System.out.println("Site added for crawling : " + w); 
       Queue.add(w); 
      } 
     } 

    } 

} 


public static void showResults() { 
    System.out.println("\n\nResults :"); 
    System.out.print("Web sites craweled: " + marked.size() + "\n"); 

    for (String s : marked) { 
     System.out.println("* " + s); 
    } 

} 

public static void main(String[] args) { 

    try { 

     bfsAlgorithm("http://www.ssaurel.com/blog"); 
     showResults(); 

    } catch (IOException e) { 

     //TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
} 

}

+0

https://docs.oracle.com/javase/7 /docs/api/java/util/Queue.html#poll() – Lemonov

+0

謝謝,但爲什麼我的隊列是空的?關於教程正在工作 –

回答

0
while (!Queue.isEmpty()) { 
    String crawledURL = Queue.poll(); 
... 
     } catch (MalformedURLException e) { 
      crawledURL = Queue.poll(); 

你不檢查第二次是隊列空

+0

謝謝。我嚴格遵循教程和視頻中的每一個細節。這就是爲什麼我試圖理解你爲什麼要捕捉這個錯誤。 –

+0

我也打印'===網站爬行:http://www.ssaurel.com/blog ==='之前去***格式不正確的URL:空循環 - –

+1

例如,在第一次(()隊列與1個網址,你提取它,現在有空隊列()。下面你得到這個URL的錯誤,從空隊列中檢索下一個元素(null!)並嘗試使用它。 你需要額外的條件(!ok && crawledUrl!= null) – rustot