2016-01-21 82 views
0

創建弱參考SCRAPY「海峽」對象我已經寫了使用scrapy在python像這樣下面蜘蛛:類型錯誤:無法在Python

#!/usr/bin/python 
from twisted.internet import reactor 
import scrapy 
from scrapy.crawler import CrawlerRunner 
from scrapy.utils.log import configure_logging 
from scrapy.selector import Selector 

class GivenSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
    ] 

    def parse(self, response): 
     select = Selector(response.body) 
     title = select.xpath("//a[@class=listinglink]/@href").extract() 
     print title 
#  for t in title: 
#   title4 = MyItem() 
#   title4['content'] = t 
#   yield title4 

#  filename = response.url.split("/")[-2] + '.html' 
#  with open(filename, 'wb') as f: 
#   f.write(response.body) 

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'}) 
runner = CrawlerRunner() 

d = runner.crawl(GivenSpider) 
d.addBoth(lambda _: reactor.stop()) 
reactor.run() 

我運行它:

$ python runTimeSpider.py 

我給出的以下輸出是:

INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
INFO: Enabled item pipelines: 
INFO: Spider opened 
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
DEBUG: Telnet console listening on 127.0.0.1:6023 
DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 
DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 
ERROR: Spider error processing <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "runTimeSpider.py", line 17, in parse 
    select = Selector(str(response.body)) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__ 
    _root = LxmlDocument(response, self._parser) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 24, in __new__ 
    cache = cls.cache.setdefault(response, {}) 
    File "/usr/lib/python2.7/weakref.py", line 433, in setdefault 
    return self.data.setdefault(ref(key, self._remove),default) 
TypeError: cannot create weak reference to 'str' object 
ERROR: Spider error processing <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "runTimeSpider.py", line 17, in parse 
    select = Selector(str(response.body)) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__ 
    _root = LxmlDocument(response, self._parser) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 24, in __new__ 
    cache = cls.cache.setdefault(response, {}) 
    File "/usr/lib/python2.7/weakref.py", line 433, in setdefault 
    return self.data.setdefault(ref(key, self._remove),default) 
TypeError: cannot create weak reference to 'str' object 
INFO: Closing spider (finished) 
INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 514, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 16284, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 1, 21, 8, 28, 26, 17960), 
'log_count/DEBUG': 3, 
'log_count/ERROR': 2, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'spider_exceptions/TypeError': 2, 
'start_time': datetime.datetime(2016, 1, 21, 8, 28, 24, 986319)} 
INFO: Spider closed (finished) 

如何打印標題? Ut有錯誤:

TypeError: cannot create weak reference to 'str' object 

回答

1

原因是您要將response.body轉換爲選擇器。 response.body是一個字符串 - 在字符串上,您不能執行XPath查詢。

因此,無論使用

select = Selector(response) 

response對象上調用正確的XPath查詢,因爲它是一個對象作爲方法包括具有xpath

title = response.xpath("//a[@class=listinglink]/@href").extract() 
+0

謝謝verymuch – MLSC