我仍在嘗試Scrapy,並試圖抓取本地網絡上的網站。該網站的IP地址爲192.168.0.185。這是我的蜘蛛:Scrapy通過IP地址抓取本地網站
from scrapy.spider import BaseSpider
class 192.168.0.185_Spider(BaseSpider):
name = "192.168.0.185"
allowed_domains = ["192.168.0.185"]
start_urls = ["http://192.168.0.185/"]
def parse(self, response):
print "Test:", response.headers
然後在我的蜘蛛我會執行這個shell命令來運行蜘蛛的同一目錄:
scrapy crawl 192.168.0.185
而且我得到一個非常醜陋,無法讀取錯誤信息:
2012-02-10 20:55:18-0600 [scrapy] INFO: Scrapy 0.14.0 started (bot: tutorial)
2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled extensions: LogStats,
TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,
DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware,
HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware,
DepthMiddleware 2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last): File "/usr/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.14.0', 'scrapy')
File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 467, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 1200, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/EGG-INFO/scripts
/scrapy", line 4, in <module>
execute()
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",
line 132, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",
line 97, in _run_print_help func(*a, **kw)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",
line 139, in _run_command cmd.run(args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/commands
/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy
/spidermanager.py", line 43, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: 192.168.0.185'
所以後來我又蜘蛛,這實際上是一樣的第一個,但它採用的是域名而不是IP地址。這個工作得很好。有誰知道交易是什麼?我如何才能讓Scrapy通過IP地址而不是域名來抓取網站?
from scrapy.spider import BaseSpider
class facebook_Spider(BaseSpider):
name = "facebook"
allowed_domains = ["facebook.com"]
start_urls = ["http://www.facebook.com/"]
def parse(self, response):
print "Test:", response.headers
嗯,我必須問 - 爲什麼你會*使用IP地址來描述主機?它們不像主機名那樣自然而然地描述,所以我建議謹慎使用它們。 – 2012-02-11 03:24:23
我建議你在使用像scrapy,django等複雜框架之前學習Python。你可以從[Python wiki](http://wiki.python.org/moin/BeginnersGuide/Programmers)選擇教程 – reclosedev 2012-02-11 04:50:15