取httpurl的協議失敗: org.apache.nutch.protocol.ProtocolNotFound:協議未找到對於 URL = HTTP在 org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85) 在org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)Nutch的1.13獲取URL的失敗:org.apache.nutch.protocol.ProtocolNotFound:找不到URL = HTTP
使用隊列模式:byHost fetch of httpsurl failed with:org.apache.nutch.protocol.ProtocolNotFound:protocol not found for url = https at org.apache.nutch.protocol.ProtocolF actory.getProtocol(ProtocolFactory.java:85) 在org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:285)
我得到上述結果,同時與solr6.6.0
運行nutch1.13指令i使用是
倉/爬行-i -D solr.server.url = http://myip/solr/nutch/網址/爬行2
下面是插件在我的nutch-site.xml的部分
<name>plugin.includes</name>
<value>
protocol-(http|httpclient)|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
</value>
下面是我的文件內容
[[email protected] apache-nutch-1.13]# ls plugins
creativecommons index-more nutch-extensionpoints protocol-file scoring-similarity urlnormalizer-ajax
feed index-replace parse-ext protocol-ftp subcollection urlnormalizer-basic
headings index-static parsefilter-naivebayes protocol-htmlunit tld urlnormalizer-host
index-anchor language-identifier parsefilter-regex protocol-http urlfilter-automaton urlnormalizer-pass
index-basic lib-htmlunit parse-html protocol-httpclient urlfilter-domain urlnormalizer-protocol
indexer-cloudsearch lib-http parse-js protocol-interactiveselenium urlfilter-domainblacklist urlnormalizer-querystring
indexer-dummy lib-nekohtml parse-metatags protocol-selenium urlfilter-ignoreexempt urlnormalizer-regex
indexer-elastic lib-regex-filter parse-replace publish-rabbitmq urlfilter-prefix urlnormalizer-slash
indexer-solr lib-selenium parse-swf publish-rabitmq urlfilter-regex
index-geoip lib-xml parse-tika scoring-depth urlfilter-suffix
index-links microformats-reltag parse-zip scoring-link urlfilter-validator
index-metadata mimetype-filter plugin scoring-opic urlmeta
我堅持這個問題。正如你所看到的,我已經包含了兩個協議 - (http | httpclient)。但仍然提取url失敗。提前致謝。
較新的發行hadoop.log
2017年9月1日14:35:07172信息solr.SolrIndexWriter - SolrIndexer: 刪除1/1文件2017年9月1日14:35:07321 WARN output.FileOutputCommitter - 輸出路徑是cleanupJob() 2017年9月1日14空:35:07323 WARN mapred.LocalJobRunner - job_local1176811933_0001 java.lang.Exception的: java.lang.IllegalStateException:連接池在 關停組織.apache.hadoop.mapred.LocalJobRunner $ Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner $ Job.run(LocalJobRunner.java:529) 引發:java.lang.IllegalStateException:連接池關閉 at org.apache.http.util.Asserts.check(Asserts的.java:34)在 org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) 在 org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) 在 有機.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) 在 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) 在 org.apache.http .impl.client.AbstractHttpClient.doExecute(AbstractHttpClient的.java:863) 在 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) 在 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient。的java:106) 在 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) 在 org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java: 481) 在 org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240) 在 org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java: 229) 在 org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) 在 org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482) 在 org.apache.solr.client.solrj.SolrClien t.commit(SolrClient.java:463) 在 org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191) 在 org.apache.nutch.indexwriter.solr.SolrIndexWriter.close( SolrIndexWriter.java:179) 在org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) 在 org.apache.nutch.indexer.CleaningJob $ DeleterReducer.close(CleaningJob.java:122) 在org.apache.hadoop.mapred.ReduceTask上org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) .run(ReduceTask.java:392)at org.apache.hadoop.mapred.LocalJobRunner $ Job $ ReduceTaskRunnable.run(LocalJobRunner.java:3 19) 在 java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511) 在java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.ThreadPoolExecutor中.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)2017-09-19 01 14:35:07,679 錯誤indexer.CleaningJob - CleaningJob:java.io.IOException:作業 失敗!在 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174)at org.apache.nutch.indexer。 CleaningJob.run(CleaningJob.java:197)at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208)
您是否嘗試使用protocol-http獲取它? – Jorge
是的。仍然提取失敗。我需要將插件包含在其他地方 – SMJ
你可以粘貼輸出:bin/nutch parsechecker http:// your_url – Jorge