2013-03-14 75 views
6

我想在crawler4j中使用基本的爬蟲示例。我從crawler4j網站here獲取了代碼。爲什麼crawler4j示例會出錯?

package edu.crawler; 

import edu.uci.ics.crawler4j.crawler.Page; 
import edu.uci.ics.crawler4j.crawler.WebCrawler; 
import edu.uci.ics.crawler4j.parser.HtmlParseData; 
import edu.uci.ics.crawler4j.url.WebURL; 
import java.util.List; 
import java.util.regex.Pattern; 
import org.apache.http.Header; 

public class MyCrawler extends WebCrawler { 

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" 
        + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); 

    /** 
    * You should implement this function to specify whether the given url 
    * should be crawled or not (based on your crawling logic). 
    */ 
    @Override 
    public boolean shouldVisit(WebURL url) { 
      String href = url.getURL().toLowerCase(); 
      return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/"); 
    } 

    /** 
    * This function is called when a page is fetched and ready to be processed 
    * by your program. 
    */ 
    @Override 
    public void visit(Page page) { 
      int docid = page.getWebURL().getDocid(); 
      String url = page.getWebURL().getURL(); 
      String domain = page.getWebURL().getDomain(); 
      String path = page.getWebURL().getPath(); 
      String subDomain = page.getWebURL().getSubDomain(); 
      String parentUrl = page.getWebURL().getParentUrl(); 
      String anchor = page.getWebURL().getAnchor(); 

      System.out.println("Docid: " + docid); 
      System.out.println("URL: " + url); 
      System.out.println("Domain: '" + domain + "'"); 
      System.out.println("Sub-domain: '" + subDomain + "'"); 
      System.out.println("Path: '" + path + "'"); 
      System.out.println("Parent page: " + parentUrl); 
      System.out.println("Anchor text: " + anchor); 

      if (page.getParseData() instanceof HtmlParseData) { 
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); 
        String text = htmlParseData.getText(); 
        String html = htmlParseData.getHtml(); 
        List<WebURL> links = htmlParseData.getOutgoingUrls(); 

        System.out.println("Text length: " + text.length()); 
        System.out.println("Html length: " + html.length()); 
        System.out.println("Number of outgoing links: " + links.size()); 
      } 

      Header[] responseHeaders = page.getFetchResponseHeaders(); 
      if (responseHeaders != null) { 
        System.out.println("Response headers:"); 
        for (Header header : responseHeaders) { 
          System.out.println("\t" + header.getName() + ": " + header.getValue()); 
        } 
      } 

      System.out.println("============="); 
    } 
} 

以上是本示例中爬蟲類的代碼。

public class Controller { 

    public static void main(String[] args) throws Exception { 
      String crawlStorageFolder = "../data/"; 
      int numberOfCrawlers = 7; 

      CrawlConfig config = new CrawlConfig(); 
      config.setCrawlStorageFolder(crawlStorageFolder); 

      /* 
      * Instantiate the controller for this crawl. 
      */ 
      PageFetcher pageFetcher = new PageFetcher(config); 
      RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 
      CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); 

      /* 
      * For each crawl, you need to add some seed urls. These are the first 
      * URLs that are fetched and then the crawler starts following links 
      * which are found in these pages 
      */ 
      controller.addSeed("http://www.ics.uci.edu/~welling/"); 
      controller.addSeed("http://www.ics.uci.edu/~lopes/"); 
      controller.addSeed("http://www.ics.uci.edu/"); 

      /* 
      * Start the crawl. This is a blocking operation, meaning that your code 
      * will reach the line after this only when crawling is finished. 
      */ 
      controller.start(MyCrawler.class, numberOfCrawlers); 
    } 
} 

以上是網絡爬蟲控制器類的類。 當我嘗試從我的IDE(的IntelliJ)我得到以下錯誤運行Controller類:

Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/uci/ics/crawler4j/crawler/CrawlConfig : Unsupported major.minor version 51.0 

有一些事是發現here我應該知道Maven的配置?我必須使用不同的版本或其他東西嗎?

+1

從它的聲音中,您試圖執行在Java的更高版本上編譯的代碼版本,然後是您正在運行的那個版本。例如。該代碼是使用Java 7編譯的,並且正在運行Java 6,或者使用Java 6進行編譯,並且正在運行Java 5 ... – MadProgrammer 2013-03-14 00:29:45

+0

檢出http://stackoverflow.com/questions/10382929/unsupported-major-minor-version -51-0 – Farlan 2013-03-14 00:34:08

+0

@hey j.jerrod taylor ..我在非常基本的程序中遇到問題。我在線程「main」java.lang.NoClassDefFoundError中獲取異常異常:org/apache/http/client/methods/HttpUriRequest \t在com.crawler.web.BasicCrawlController.main(BasicCrawlController.java:78) 產生的原因:拋出java.lang.ClassNotFoundException:org.apache.http.client.methods.HttpUriRequest,請建議是否有其他罐是也是必需的。 – 2013-06-14 16:47:29

回答

1

問題不在於crawler4j。問題在於我使用的Java版本與在crawler4j中使用的Java最新版本不同。在更新到Java 7之前,我切換了版本,並且一切正常。我猜測,將我的Java版本升級到7版本會產生相同的效果。

+0

我應該爬行動態網站使用crawler4j(java)。 http://stackoverflow.com/questions/27264931/crawling-dynamic-website-using-java?noredirect=1#comment43002565_27264931 – BasK 2014-12-03 11:42:53