2016-03-08 86 views
1

我是一個使用crawler4j構建塊構建的簡單Web爬網程序。我正在嘗試構建一個字典,因爲我的爬蟲爬網,然後將它傳遞給我的主(控制器),因爲它構建和解析文本。我如何做到這一點,因爲我的MyCrawler對象不是在我的主類中創建的(使用MyCrawler.class作爲第一個參數)?另外,我無法更改controller.start方法。我希望能夠在抓取工具完成後使用在抓取工具中創建的字典。使用crawler4j在類之間傳遞一個對象

我認爲能做到這一點的最佳方法是使用controller.start創建一個預定義並創建的MyCrawler對象,但無法做到這一點,我可以看到。

以下是我的代碼。非常感謝您的幫助!

履帶:

public class MyCrawler extends WebCrawler 
{ 
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$"); 
    public ArrayList<String> dictionary = new ArrayList<String>(); 

    @Override public boolean shouldVisit(Page referringPage, WebURL url) 
    { 
     String href = url.getURL().toLowerCase(); 
     return !FILTERS.matcher(href).matches() 
       && href.startsWith("http://lyle.smu.edu/~fmoore")); 
    } 

    @Override public void visit(Page page) 
    { 
     String url = page.getWebURL().getURL(); 
     System.out.println("URL: " + url); 
     if(page.getParseData() instanceof HtmlParseData) 
     { 
      HtmlParseData h = (HtmlParseData)page.getParseData(); 
      String text = h.getText(); 

      String[] words = text.split(" "); 
      for(int i = 0;i < words.length;i++) 
      { 
       if(!words[i].equals("") || !words[i].equals(null) || !words[i].equals("\n")) 
        dictionary.add(words[i]); 
      } 

      String html = h.getHtml(); 
      Set<WebURL> links = h.getOutgoingUrls(); 

      System.out.println("Text length: " + text.length()); 
      System.out.println("Html length: " + html.length()); 
      System.out.println("Number of outgoing links: " + links.size()); 
      System.out.println(text); 
     } 
    } 
} 

控制器:

public class Controller 
{ 
    public ArrayList<String> dictionary = new ArrayList<String>(); 

    public static void main(String[] args) throws Exception 
    { 
     int numberOfCrawlers = 1; 
     String crawlStorageFolder = "/data/crawl/root"; 

     CrawlConfig c = new CrawlConfig(); 
     c.setCrawlStorageFolder(crawlStorageFolder); 
     c.setMaxDepthOfCrawling(-1); //Unlimited Depth 
     c.setMaxPagesToFetch(-1);  //Unlimited Pages 
     c.setPolitenessDelay(200);  //Politeness Delay 

     PageFetcher pf = new PageFetcher(c); 
     RobotstxtConfig robots = new RobotstxtConfig(); 
     RobotstxtServer rs = new RobotstxtServer(robots, pf); 
     CrawlController controller = new CrawlController(c, pf, rs); 

     controller.addSeed("http://lyle.smu.edu/~fmoore"); 

     controller.start(MyCrawler.class, numberOfCrawlers);   

     controller.shutdown(); 
     controller.waitUntilFinish(); 
    } 
} 

回答

2

讓一個WebCrawlerFactory創建MyCrawler對象。這應該做到這一點(至少從版本4.2開始)。不過你的dictionary應該支持併發訪問(簡單的ArrayList不行!)

// use a factory, instead of supplying the crawler type to pass the dictionary 
controller.start(new WebCrawlerFactory<MyCrawler>() { 
    @Override 
    public MyCrawler newInstance() throws Exception { 
     return new MyCrawler(dictionary); 
    } 
}, numberOfCrawlers);