2016-08-18 98 views
0

我颳了一個維基百科頁面來提取所有圖像網址,這裏是它的代碼。爲什麼restrict_xpath忽略裏面<a>標籤裏面的hrefs?

from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 

class WikiSpider(CrawlSpider): 
    name = 'wiki' 
    allowed_domains = ['en.wikipedia.org'] 
    start_urls = ['https://en.wikipedia.org/wiki/Katy_Perry'] 

    rules = [Rule(LinkExtractor(restrict_xpaths=('//a[@class="image"]')), 
      callback='parse_item', follow=False),] 

    def parse_item(self, response): 
     print(response.url) 

當我運行的蜘蛛,它沒有顯示任何結果,但是當我改變內部restrict_xpaths中的XPath它打印一些隨機鏈接。我需要hpath在xpath '//a[@class="image"]'但它不工作,原因是什麼?我知道我可以使用基本蜘蛛而不是CrawlSpider,並完全避免使用規則。但我想知道爲什麼我輸入的xpath不起作用,並且restrict_xpaths接受哪種xpath和html標籤?

回答

2

你想要的鏈接圖片:

$ scrapy shell "https://en.wikipedia.org/wiki/Katy_Perry" -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36' 
2016-08-19 11:17:05 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot) 
(...) 
2016-08-19 11:17:06 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Katy_Perry> (referer: None) 
(...) 
In [1]: response.xpath('//a[@class="image"]/@href').extract() 
Out[1]: 
['/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg', 
'/wiki/File:Katy_Perry_performing.jpg', 
'/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg', 
'/wiki/File:PWT_Cropped.jpg', 
'/wiki/File:Alanis_Morissette_5-19-2014.jpg', 
'/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg', 
'/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg', 
'/wiki/File:Katy_Perry_UNICEF_2012.jpg', 
'/wiki/File:Katy_Perry_Hillary_Clinton,_I%27m_With_Her_Concert.jpg', 
'/wiki/File:Wikiquote-logo.svg', 
'/wiki/File:Commons-logo.svg'] 

,並默認鏈接提取過濾a lot of extensions,包括圖片:

In [2]: from scrapy.linkextractors import LinkExtractor 

In [3]: LinkExtractor(restrict_xpaths=('//a[@class="image"]')).extract_links(response) 
Out[3]: [] 

您可以use deny_extensions=[]爲不過濾任何東西:

In [4]: LinkExtractor(restrict_xpaths=('//a[@class="image"]'), deny_extensions=[]).extract_links(response) 
Out[4]: 
[Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_performing.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:PWT_Cropped.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Alanis_Morissette_5-19-2014.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_UNICEF_2012.jpg', text='', fragment='', nofollow=False), 
Link(url="https://en.wikipedia.org/wiki/File:Katy_Perry_Hillary_Clinton,_I'm_With_Her_Concert.jpg", text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Wikiquote-logo.svg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Commons-logo.svg', text='', fragment='', nofollow=False)] 
+0

非常感謝。我幾乎從不使用調試器,現在我知道如何正確使用它。 –

+0

Scrapy外殼並不是一個真正的調試器,但它是您在scrapy項目中最好的朋友之一。 –

+0

出於好奇,我試圖從'img'標籤內的'src'屬性中獲取圖像鏈接,就像這個'LinkExtractor(restrict_xpaths =('// img'),deny_extensions = [])。extract_links(response)''。然後,我像''LinkExtractor(restrict_xpaths =('// img'),deny_extensions = [],attrs =('src'))。extract_links(response)'''將'attrs'參數更改爲'src',返回空列表。我知道這都是不必要的,但不'restrict_xpath'接受任何標籤或我做錯了什麼? –