2017-01-03 80 views
0

我剛開始學習Python和Scrapy。使用scrapy時,爬行0頁(0頁/分鐘)刮0項(0項/分)

我的第一個項目是抓取包含網絡安全信息的網站上的信息。但是,當我運行,使用CMD,它說,

爬0頁(0頁/分鐘),刮0件(0個/分鐘)

但是沒有一樣能出來。如果有人能解決我的問題,我會很感激。

以下是我的蜘蛛文件:

from ssl_abuse.items import SslAbuseItem 
import scrapy 

class SslAbuseSpider(scrapy.Spider): 
    name='ssl_abuse' 
    start_urls=['https://sslbl.abuse.ch/'] 
    def parse(self, response): 
     for sel in response.xpath('/table//tr'): 
      item=SslAbuseItem() 
      item['date']=sel.xpath('/td/text()')[0].extract() 
      item['name']=sel.xpath('/td/text()')[2].extract() 
      item['type']=sel.xpath('/td/text()')[3].extract() 
      yield item 

以下是該網站我即將抓取:

https://sslbl.abuse.ch/ 

我希望得到該圖表的所有元素,除了SHA1指紋..


當我改變了我的代碼後,就像Will說的那樣,出現了一個錯誤:

`2017-01-04 09:31:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-01-04 09:31:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-01-04 09:31:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sslbl.abuse.ch/robots.txt> (referer: None) 
2017-01-04 09:31:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sslbl.abuse.ch/> (referer: None) 
2017-01-04 09:31:53 [scrapy.core.scraper] ERROR: Spider error processing <GET https://sslbl.abuse.ch/> (referer: None) 
Traceback (most recent call last): 
    File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback 
    yield next(it) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "V:\work\ssl_abuse\ssl_abuse\spiders\ssl_abuse_spider.py", line 11, in parse 
    item['date']=sel.xpath('/td/text()')[0].extract() 
    File "c:\python27\lib\site-packages\parsel\selector.py", line 58, in __getitem__ 
    o = super(SelectorList, self).__getitem__(pos) 
IndexError: list index out of range` 

我更新的代碼: `

from ssl_abuse.items import SslAbuseItem 
import scrapy 
class SslAbuseSpider(scrapy.Spider): 
    name='ssl_abuse' 
    start_urls=['https://sslbl.abuse.ch/'] 
    def parse(self, response): 
     for sel in response.xpath('//table//tr'): 
      item=SslAbuseItem() 
      item['date']=sel.xpath('/td/text()')[0].extract() 
      item['name']=sel.xpath('/td/text()')[2].extract() 
      item['type']=sel.xpath('/td/text()')[3].extract() 
      yield item` 

回答

0

我做了一個快速測試與scrapy外殼。似乎xpath定位器有問題。 的response.body樣子:

... 
<table class="sortable"> 
<tr><th>Listing date (UTC)</th><th>SHA1 fingerprint</th><th>Common Name</th><th>Listing reason</th></tr> 
<tr bgcolor="#D8D8D8" onmouseover="this.style.backgroundColor='#3371A3';" onmouseout="this.style.backgroundColor='#D8D8D8';"><td>2016-12-30 07:54:19</td><td><a href="/intel/1d05c6fef14d2671d759a05b496464b831c650e8" target="_parent" title="Show more information about this SSL certificate">1d05c6fef14d2671d759a05b496464b831c650e8</a></td><td>host/[email protected]</td><td>Gootkit C&amp;C</td></tr> 
<tr bgcolor="#ffffff" onmouseover="this.style.backgroundColor='#3371A3';" onmouseout="this.style.backgroundColor='#ffffff';"><td>2016-12-28 10:03:54</td><td><a href="/intel/a82dd258544acf0a109296493421262397741db7" target="_parent" title="Show more information about this SSL certificate">a82dd258544acf0a109296493421262397741db7</a></td><td>google.com/[email protected]</td><td>Gootkit C&amp;C</td></tr> 
<tr bgcolor="#D8D8D8" onmouseover="this.style.backgroundColor='#3371A3';" onmouseout="this.style.backgroundColor='#D8D8D8';"><td>2016-12-27 19:19:35</td><td><a href="/intel/df6f665e91d2fe8a338f778ad53c1921fcab3d8f" target="_parent" title="Show more information about this SSL certificate">df6f665e91d2fe8a338f778ad53c1921fcab3d8f</a></td><td>CN=p.fmsacademy.it</td><td>Gozi MITM</td></tr> 
... 

的第一個項目是表頭,真正的內容也會從第二排發車。 例如:

# scrapy shell 'https://sslbl.abuse.ch/' 
>>> rows = response.xpath('//table//tr') 
>>> head = rows[0] 

>>> head.xpath('th/text()').extract() 
[u'Listing date (UTC)', u'SHA1 fingerprint', u'Common Name', u'Listing reason'] 

>>> td1 = rows[1] 
>>> td1.xpath('td') 
[<Selector xpath='td' data=u'<td>2016-12-30 07:54:19</td>'>, <Selector xpath='td' data=u'<td><a href="/intel/1d05c6fef14d2671d759'>, <Selector xpath='td' data=u'<td>host/[email protected]</td>'>, <Selector xpath='td' data=u'<td>Gootkit C&amp;C</td>'>] 

>>> td1.xpath('td/text()').extract() 
[u'2016-12-30 07:54:19', u'host/[email protected]', u'Gootkit C&C'] 

因此,XPath來定位TR應該是:

for sel in response.xpath('//table//tr'): 

的XPath來定位TD文字是:

item['date']=sel.xpath('td/text()')[0].extract() 
+0

我更新了我的代碼,但一個錯誤來up ... –

+0

你可以改變路徑'td/text()'刪除開始'/'嗎?你指定的路徑'/ td/text()'沒有找到任何元素。這就是爲什麼當你試圖獲得第一個項目時你得到了「索引不足」的錯誤。 item ['date'] = sel.xpath('/ td/text()')[0] .extract() – Will

+0

我剛注意到我的答案中找到td文本的xpath是錯誤的。現在我刪除了開始的'/'。 – Will