我需要幫助瞭解如何爬過此頁面: http://www.marinetraffic.com/en/ais/index/ports/all 通過每個端口,並提取名稱和座標並將它們寫入文件。 主類如下所示:使用crawler4j爬行和提取信息
import java.io.FileWriter;
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class WorldPortSourceCrawler {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "data";
int numberOfCrawlers = 5;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(2);
config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
//config.setPolitenessDelay(20);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(PortExtractor.class, numberOfCrawlers);
System.out.println("finished reading");
System.out.println("Ports: " + PortExtractor.portList.size());
FileWriter writer = new FileWriter("PortInfo2.txt");
System.out.println("Writing to file...");
for(Port p : PortExtractor.portList){
writer.append(p.print() + "\n");
writer.flush();
}
writer.close();
System.out.println("File written");
}
}
而港口提取看起來是這樣的:
public class PortExtractor extends WebCrawler{
private final static Pattern FILTERS = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
);
public static List<Port> portList = new ArrayList<Port>();
/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//return !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}
/**
* This function is called when a page is fetched and ready
* to be processed
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
}
}
我怎麼去寫HTML解析器,還怎麼能我指定的程序,它不應該通過端口信息鏈接以外的任何其他方式爬行? 即使運行代碼,我也遇到了困難,每次嘗試使用HTML解析時都會中斷。請任何幫助,將不勝感激。
感謝shouldVisit()比我最初有更好的調整;但是我的抓取工具似乎只能訪問種子設定的頁面,並且在此之後不會再深入。任何原因爲什麼這可能是? – Almanz
您將maxDepth設置爲2.再次檢查您的crawlconfig或使用當前的問題更新您的問題。 – rzo
我更新了我的代碼到屏幕上;至於爬行深度,我保持不變,只依靠過濾來完成工作。然而,我似乎仍然遇到了問題,因爲抓取工具只能抓取以下網址:http://www.marinetraffic.com/en/ais/index/ports/all – Almanz