除了我使用Apache Nutch的1.12，我試圖抓取的網址在seed.txt

指定的一個Nutch的未檢索的網址是一樣的東西https://www.mywebsite.com/abc-def/這是我seed.txt文件的唯一入口。因爲我不希望任何頁面是沒有「ABC-DEF」的網址，所以我已經把下面的行正則表達式，urlfilter.txt爬行：除了我使用Apache Nutch的1.12，我試圖抓取的網址在seed.txt

+^https://www.mywebsite.com/abc-def/(.+)*$

當我嘗試運行下面爬命令：

**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**

它抓取和索引只是一個seed.txt URL，然後在第二次迭代它只是說：

Generator: starting at 2017-02-28 09:51:36 

Generator: Selecting best-scoring urls due for fetch. 

Generator: filtering: false 

Generator: normalizing: true 

Generator: topN: 50000 

Generator: 0 records selected for fetching, exiting ... 

Generate returned 1 (no new segments created) 

Escaping loop: no more URLs to fetch now

當我改變了正則表達式，urlfilter.txt允許的一切（ +。）它開始索引每個網址https://www.mywebsite.com這當然我不想要。

如果有人碰巧有同樣的問題，請您分享如何讓過去吧。

來源

2017-02-27 Torukmakto

明白了在過去的2 days.Here嘗試多次後的事情工作方案如下：

由於網站我是爬行是非常沉重的，在Nutch的-default.xml中財產被其截斷爲65536個字節（默認情況下）。該鏈接我要爬很遺憾沒有獲得包含在所選擇的一部分，因此Nutch的不是爬行。當我通過將以下值在Nutch的-site.xml中改成了無限的它開始抓取我的網頁：

<property> 
    <name>http.content.limit</name> 
    <value>-1</value> 
    <description>The length limit for downloaded content using the http:// 
    protocol, in bytes. If this value is nonnegative (>=0), content longer 
    than it will be truncated; otherwise, no truncation at all. Do not 
    confuse this setting with the file.content.limit setting. 
    </description> 
</property>

來源

2017-03-03 05:23:40 Torukmakto

您可以嘗試調整中的conf/Nutch的-default.xml中可用的屬性。也許控制你想要的outlinks的數量或者修改提取屬性。如果您決定覆蓋任何屬性，請將該信息複製到conf/nutch-site.xml並在其中添加新值。

來源

2017-02-28 18:41:27

請問您可以更具體哪些屬性，我應該調整，使這項工作。我已經通過將它們複製到nutch-site.xml嘗試了一些，但它不起作用。 – Torukmakto

所以基本上你不想抓取https://www.mywebsite.com/abc-def/任何外部鏈接，對不對？如果是這樣，請嘗試使用設置' db.ignore.external.links''作爲的TRUE'價值？讓我知道，我可以相應地編輯答案。 –

不，我想有從mywebsite.com/abc-def/啓動路徑的內部鏈接。無論如何，我想我已經解決了這個問題。感謝您的幫助。 – Torukmakto

除了我使用Apache Nutch的1.12，我試圖抓取的網址在seed.txt

回答

相關問題