我正在使用Nutch 1.6來抓取一些論壇並使用Solr 1.6.2對它們進行索引。我在Solr上運行了一個測試查詢,並且感到驚訝的是隻有幾個結果。我擔心Nutch解析頁面或Solr索引時出現問題。周圍窺探後,我發現,Nutch的未解析很多網頁已檢索:爲什麼Nutch認爲它已經解析了所有細分市場?
bin/nutch readseg -list -dir crawl-mothering2/segments/
NAME GENERATED FETCHED PARSED
20130228001531 23 27 9
20130228003940 1430 1434 661
20130228001829 202 206 105
20130228061337 1068 1090 475
20130228091009 1 2 0
20130228085956 34 34 25
20130228090348 44 45 34
20130228090851 7 7 6
20130228080438 364 374 192
20130228030933 1774 1795 903
20130228084205 168 169 63
但是當我嘗試解析段,我得到這個:
bin/nutch parse crawl-mothering2/segments/*
ParseSegment: starting at 2013-03-21 00:20:43
ParseSegment: segment: crawl-mothering2/segments/20130228001531
Exception in thread "main" java.io.IOException: Segment already parsed!
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216)
是什麼給了?