2016-05-12 106 views
0

爲什麼下面的代碼構建crawler4j只抓取給定的種子URL並且不開始抓取其他鏈接?crawler4j只抓取種子URL

public static void main(String[] args) 
{ 
     String crawlStorageFolder = "F:\\crawl"; 
     int numberOfCrawlers = 7; 

     CrawlConfig config = new CrawlConfig(); 
     config.setCrawlStorageFolder(crawlStorageFolder); 
     config.setMaxDepthOfCrawling(4); 
     /* 
     * Instantiate the controller for this crawl. 
     */ 
     PageFetcher pageFetcher = new PageFetcher(config); 

     RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
     robotstxtConfig.setEnabled(false); 

     RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 
     CrawlController controller = null; 
     try { 
      controller = new CrawlController(config, pageFetcher, robotstxtServer); 
     } catch (Exception e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 

     /* 
     * For each crawl, you need to add some seed urls. These are the first 
     * URLs that are fetched and then the crawler starts following links 
     * which are found in these pages 
     */ 
     controller.addSeed("http://edition.cnn.com/2016/05/11/politics/paul-ryan-donald-trump-meeting/index.html");   

     /* 
     * Start the crawl. This is a blocking operation, meaning that your code 
     * will reach the line after this only when crawling is finished. 
     */ 
     controller.start(MyCrawler.class, numberOfCrawlers); 

    } 
+0

你可以在MyCrawler.class中發佈'shouldVisit'方法的代碼嗎? – rzo

+0

謝謝,我的壞。在例子shouldVisit包含他們的網站作爲「必須有」的域名 - 錯過了 - 現在一切正常:) – user1025852

+0

這就是我想到的:) – rzo

回答

2

官方示例僅限於www.ics.uci.edu域。因此,擴展Crawler類中的shouldVisit方法需要進行修改。

/** 
    * You should implement this function to specify whether the given url 
    * should be crawled or not (based on your crawling logic). 
    */ 
    @Override 
    public boolean shouldVisit(Page referringPage, WebURL url) { 
    String href = url.getURL().toLowerCase(); 
    // Ignore the url if it has an extension that matches our defined set of image extensions. 
    if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
     return false; 
    } 

    // Only accept the url if it is in the "www.ics.uci.edu" domain and protocol is "http". 
    return href.startsWith("http://www.ics.uci.edu/"); 
    }