使用crawler4j爬行和提取信息

我需要幫助瞭解如何爬過此頁面： http://www.marinetraffic.com/en/ais/index/ports/all 通過每個端口，並提取名稱和座標並將它們寫入文件。主類如下所示：使用crawler4j爬行和提取信息

import java.io.FileWriter; 

import edu.uci.ics.crawler4j.crawler.CrawlConfig; 
import edu.uci.ics.crawler4j.crawler.CrawlController; 
import edu.uci.ics.crawler4j.fetcher.PageFetcher; 
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; 
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; 


public class WorldPortSourceCrawler { 

    public static void main(String[] args) throws Exception { 
     String crawlStorageFolder = "data"; 
     int numberOfCrawlers = 5; 

     CrawlConfig config = new CrawlConfig(); 
     config.setCrawlStorageFolder(crawlStorageFolder); 
     config.setMaxDepthOfCrawling(2); 
     config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second"); 
     //config.setPolitenessDelay(20); 
     /* 
      * Instantiate the controller for this crawl. 
      */ 
     PageFetcher pageFetcher = new PageFetcher(config); 
     RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
     RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 
     CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); 

     /* 
      * For each crawl, you need to add some seed urls. These are the first 
      * URLs that are fetched and then the crawler starts following links 
      * which are found in these pages 
      */ 
     controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all"); 

     /* 
      * Start the crawl. This is a blocking operation, meaning that your code 
      * will reach the line after this only when crawling is finished. 
      */ 
     controller.start(PortExtractor.class, numberOfCrawlers);  

     System.out.println("finished reading"); 
     System.out.println("Ports: " + PortExtractor.portList.size()); 
     FileWriter writer = new FileWriter("PortInfo2.txt"); 

     System.out.println("Writing to file..."); 
     for(Port p : PortExtractor.portList){ 
      writer.append(p.print() + "\n"); 
      writer.flush(); 
     } 
     writer.close(); 
     System.out.println("File written"); 
     } 
}

而港口提取看起來是這樣的：

public class PortExtractor extends WebCrawler{ 

    private final static Pattern FILTERS = Pattern.compile(
      ".*(\\.(css|js|bmp|gif|jpe?g" 
      + "|png|tiff?|mid|mp2|mp3|mp4" 
      + "|wav|avi|mov|mpeg|ram|m4v|pdf" 
      + "|rm|smil|wmv|swf|wma|zip|rar|gz))$" 
     ); 

    public static List<Port> portList = new ArrayList<Port>(); 

/** 
* 
* Crawling logic 
*/ 
//@Override 
public boolean shouldVisit(WebURL url) { 

String href = url.getURL().toLowerCase(); 
//return !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/"); 
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all"); 
} 



/** 
* This function is called when a page is fetched and ready 
* to be processed 
*/ 
@Override 
public void visit(Page page) {   
String url = page.getWebURL().getURL(); 
System.out.println("URL: " + url); 

    } 

}

我怎麼去寫HTML解析器，還怎麼能我指定的程序，它不應該通過端口信息鏈接以外的任何其他方式爬行？即使運行代碼，我也遇到了困難，每次嘗試使用HTML解析時都會中斷。請任何幫助，將不勝感激。

來源

2016-11-21 Almanz

第一項任務是檢查網站的robots.txt以便檢查，crawler4j是否會對本網站進行實時檢索。調查這個文件，我們會發現，這將沒有問題：

User-agent: * 
Allow:/
Disallow: /mob/ 
Disallow: /upload/ 
Disallow: /users/ 
Disallow: /wiki/

其次，我們需要搞清楚，哪些鏈接是你的目的特別感興趣的。這需要一些手動調查。我只檢查了上述鏈接的幾個條目，但我發現，每個端口都在其鏈接中包含關鍵字ports，例如，

http://www.marinetraffic.com/en/ais/index/ports/all/per_page:50 
http://www.marinetraffic.com/en/ais/details/ports/18853/China_port:YANGZHOU 
http://www.marinetraffic.com/en/ais/details/ports/793/Korea_port:BUSAN

有了這些信息，我們可以修改shouldVisit法白名單的方式。

public boolean shouldVisit(Page referringPage, WebURL url){ 

String href = url.getURL().toLowerCase(); 
return !FILTERS.matcher(href).matches() 
     && href.contains("www.marinetraffic.com"); 
     && href.contains("ports"); 
}

這是一個非常簡單的實現，它可以通過正則表達式來增強。

第三，我們需要從HTML中解析數據。您正在尋找包含在以下<div>部分的信息：

<div class="bg-info bg-light padding-10 radius-4 text-left"> 
    <div> 
     <span>Latitude/Longitude: </span> 
     <b>1.2593655°/103.75445°</b> 
     <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655" title="Show on Map"><img class="loaded" src="/img/icons/show_on_map_magnify.png" data-original="/img/icons/show_on_map_magnify.png" alt="Show on Map" title="Show on Map"></a> 
     <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655/showports:1" title="Show on Map">Show on Map</a> 
    </div> 

    <div> 
     <span>Local Time:</span> 
       <b><time>2016-12-11 19:20</time>&nbsp;[UTC +8]</b> 
    </div> 

      <div> 
      <span>Un/locode: </span> 
      <b>SGSIN</b> 
     </div> 

      <div> 
      <span>Vessels in Port: </span> 
      <b><a href="/en/ais/index/ships/range/port_id:290/port_name:SINGAPORE">1021</a></b> 
     </div> 

      <div> 
      <span>Expected Arrivals: </span> 
      <b><a href="/en/ais/index/eta/all/port:290/portname:SINGAPORE">1059</a></b> 
     </div> 

</div>

基本上，我會用一個HTML解析器（例如Jericho）完成這個任務。然後，您可以準確提取正確的<div>部分並獲取您正在查找的屬性。

來源

2016-12-11 11:24:57 rzo

感謝shouldVisit（）比我最初有更好的調整;但是我的抓取工具似乎只能訪問種子設定的頁面，並且在此之後不會再深入。任何原因爲什麼這可能是？ – Almanz

您將maxDepth設置爲2.再次檢查您的crawlconfig或使用當前的問題更新您的問題。 – rzo

我更新了我的代碼到屏幕上;至於爬行深度，我保持不變，只依靠過濾來完成工作。然而，我似乎仍然遇到了問題，因爲抓取工具只能抓取以下網址：http：//www.marinetraffic.com/en/ais/index/ports/all – Almanz

使用crawler4j爬行和提取信息

回答

相關問題