爲什麼我的crawledURL爲空？

我正在學習一個教程，在java中創建一個web爬蟲。當我運行代碼時，我的crawledURL是null。 ***格式不正確的URL：在無限循環中爲null。爲什麼我的crawledURL爲空？

任何人都可以向我解釋爲什麼會發生這種情況？

這裏是整個代碼：

import java.util.*; 
import java.util.regex.Matcher; 
import java.util.regex.Pattern; 
import java.io.*; 
import java.net.*; 

public class WebCrawler { 

public static Queue<String> Queue = new LinkedList<>(); 
public static Set<String> marked = new HashSet<>(); 
public static String regex = "http[s]://(\\w+\\.)*(\\w+)"; 

public static void bfsAlgorithm(String root) throws IOException { 

    Queue.add(root); 

    while (!Queue.isEmpty()) { 

     String crawledURL = Queue.poll(); 
     System.out.println("\n=== Site crawled : " + crawledURL + "==="); 

     //Limiting to a 100 websites here 

     if(marked.size() > 100) 
      return; 

     boolean ok = false; 
     URL url = null; 
     BufferedReader br = null; 

     while (!ok) { 
      try { 
       url = new URL(crawledURL); 
       br = new BufferedReader(new InputStreamReader(url.openStream())); 
       ok = true; 

      } catch (MalformedURLException e) { 
       System.out.println("*** Malformed URL :" + crawledURL); 
       crawledURL = Queue.poll(); 
       ok = false; 

      } catch (IOException ioe) { 
       System.out.println("*** IOException for URL :" + crawledURL); 
       crawledURL = Queue.poll(); 
       ok = false; 


     } 

    } 

     StringBuilder sb = new StringBuilder(); 

     while((crawledURL = br.readLine()) != null) { 
      sb.append(crawledURL); 
     } 

     crawledURL = sb.toString(); 
     Pattern pattern = Pattern.compile(regex); 
     Matcher matcher = pattern.matcher(crawledURL); 


     while (matcher.find()){ 

      String w = matcher.group(); 

      if (!marked.contains(w)) { 
       marked.add(w); 
       System.out.println("Site added for crawling : " + w); 
       Queue.add(w); 
      } 
     } 

    } 

} 


public static void showResults() { 
    System.out.println("\n\nResults :"); 
    System.out.print("Web sites craweled: " + marked.size() + "\n"); 

    for (String s : marked) { 
     System.out.println("* " + s); 
    } 

} 

public static void main(String[] args) { 

    try { 

     bfsAlgorithm("http://www.ssaurel.com/blog"); 
     showResults(); 

    } catch (IOException e) { 

     //TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
}

}

來源

2017-07-04 G. Doe

https://docs.oracle.com/javase/7 /docs/api/java/util/Queue.html#poll（） – Lemonov

謝謝，但爲什麼我的隊列是空的？關於教程正在工作 –

while (!Queue.isEmpty()) { 
    String crawledURL = Queue.poll(); 
... 
     } catch (MalformedURLException e) { 
      crawledURL = Queue.poll();

你不檢查第二次是隊列空

來源

2017-07-04 07:04:52 rustot

謝謝。我嚴格遵循教程和視頻中的每一個細節。這就是爲什麼我試圖理解你爲什麼要捕捉這個錯誤。 –

我也打印'===網站爬行：http：//www.ssaurel.com/blog ==='之前去***格式不正確的URL：空循環 - –

例如，在第一次（（）隊列與1個網址，你提取它，現在有空隊列（）。下面你得到這個URL的錯誤，從空隊列中檢索下一個元素（null！）並嘗試使用它。你需要額外的條件（！ok && crawledUrl！= null） – rustot

爲什麼我的crawledURL爲空？

回答

相關問題