2013-03-21 60 views
1

我正在使用Nutch 1.6來抓取一些論壇並使用Solr 1.6.2對它們進行索引。我在Solr上運行了一個測試查詢,並且感到驚訝的是隻有幾個結果。我擔心Nutch解析頁面或Solr索引時出現問題。周圍窺探後,我發現,Nutch的未解析很多網頁已檢索:爲什麼Nutch認爲它已經解析了所有細分市場?

bin/nutch readseg -list -dir crawl-mothering2/segments/ 

NAME  GENERATED FETCHED PARSED 
20130228001531 23  27  9 
20130228003940 1430 1434  661 
20130228001829 202  206  105 
20130228061337 1068 1090  475 
20130228091009 1  2   0 
20130228085956 34  34  25 
20130228090348 44  45  34 
20130228090851 7  7   6 
20130228080438 364  374  192 
20130228030933 1774 1795  903 
20130228084205 168  169  63 

但是當我嘗試解析段,我得到這個:

bin/nutch parse crawl-mothering2/segments/* 
ParseSegment: starting at 2013-03-21 00:20:43 
ParseSegment: segment: crawl-mothering2/segments/20130228001531 
Exception in thread "main" java.io.IOException: Segment already parsed! 
    at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:416) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) 
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) 
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) 
    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216) 

是什麼給了?

回答

1

更快的響應如果你想重新解析,進入爬行/段/和

rm -rf parse_text parse_data crawl_parse 

,那麼你可以運行

bin/nutch parse crawldir/segments/<segmentnumber> 
相關問題