爲什麼restrict_xpath忽略裏面<a>標籤裏面的hrefs？

我颳了一個維基百科頁面來提取所有圖像網址，這裏是它的代碼。爲什麼restrict_xpath忽略裏面<a>標籤裏面的hrefs？

from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 

class WikiSpider(CrawlSpider): 
    name = 'wiki' 
    allowed_domains = ['en.wikipedia.org'] 
    start_urls = ['https://en.wikipedia.org/wiki/Katy_Perry'] 

    rules = [Rule(LinkExtractor(restrict_xpaths=('//a[@class="image"]')), 
      callback='parse_item', follow=False),] 

    def parse_item(self, response): 
     print(response.url)

當我運行的蜘蛛，它沒有顯示任何結果，但是當我改變內部restrict_xpaths中的XPath它打印一些隨機鏈接。我需要hpath在xpath '//a[@class="image"]'但它不工作，原因是什麼？我知道我可以使用基本蜘蛛而不是CrawlSpider，並完全避免使用規則。但我想知道爲什麼我輸入的xpath不起作用，並且restrict_xpaths接受哪種xpath和html標籤？

來源

2016-08-18 Uchiha Madara

你會傳遞一些像'restrict_xpaths ='// td [@ colspan =「2」]'）'的地方nchors在 –

之內但是我之前通過了''標籤作爲xpaths，並且它們都工作正常。例如在[reddit.com/r/pics](https://www.reddit.com/r/pics）中，爲了抓取下一頁按鈕內的'href'，我使用了'// a [@ rel =「 nofollow下一個「]'它工作，它爬行下一頁。我不明白爲什麼在這個例子中類似的xpath'// a [@ class =「image」]'不起作用 –

你想要的鏈接圖片：

$ scrapy shell "https://en.wikipedia.org/wiki/Katy_Perry" -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36' 
2016-08-19 11:17:05 [scrapy] INFO: Scrapy 1.1.2 started (bot: scrapybot) 
(...) 
2016-08-19 11:17:06 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Katy_Perry> (referer: None) 
(...) 
In [1]: response.xpath('//a[@class="image"]/@href').extract() 
Out[1]: 
['/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg', 
'/wiki/File:Katy_Perry_performing.jpg', 
'/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg', 
'/wiki/File:PWT_Cropped.jpg', 
'/wiki/File:Alanis_Morissette_5-19-2014.jpg', 
'/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg', 
'/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg', 
'/wiki/File:Katy_Perry_UNICEF_2012.jpg', 
'/wiki/File:Katy_Perry_Hillary_Clinton,_I%27m_With_Her_Concert.jpg', 
'/wiki/File:Wikiquote-logo.svg', 
'/wiki/File:Commons-logo.svg']

，並默認鏈接提取過濾 a lot of extensions，包括圖片：

In [2]: from scrapy.linkextractors import LinkExtractor 

In [3]: LinkExtractor(restrict_xpaths=('//a[@class="image"]')).extract_links(response) 
Out[3]: []

您可以use deny_extensions=[]爲不過濾任何東西：

In [4]: LinkExtractor(restrict_xpaths=('//a[@class="image"]'), deny_extensions=[]).extract_links(response) 
Out[4]: 
[Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_DNC_July_2016_(cropped).jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_performing.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry%E2%80%93Zenith_Paris.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:PWT_Cropped.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Alanis_Morissette_5-19-2014.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Freddie_Mercury_performing_in_New_Haven,_CT,_November_1977.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_California_Dreams_Tour_01.jpg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Katy_Perry_UNICEF_2012.jpg', text='', fragment='', nofollow=False), 
Link(url="https://en.wikipedia.org/wiki/File:Katy_Perry_Hillary_Clinton,_I'm_With_Her_Concert.jpg", text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Wikiquote-logo.svg', text='', fragment='', nofollow=False), 
Link(url='https://en.wikipedia.org/wiki/File:Commons-logo.svg', text='', fragment='', nofollow=False)]

來源

2016-08-19 09:21:24

非常感謝。我幾乎從不使用調試器，現在我知道如何正確使用它。 –

Scrapy外殼並不是一個真正的調試器，但它是您在scrapy項目中最好的朋友之一。 –

出於好奇，我試圖從'img'標籤內的'src'屬性中獲取圖像鏈接，就像這個'LinkExtractor（restrict_xpaths =（'// img'），deny_extensions = []）。extract_links（response）''。然後，我像''LinkExtractor（restrict_xpaths =（'// img'），deny_extensions = []，attrs =（'src'））。extract_links（response）'''將'attrs'參數更改爲'src'，返回空列表。我知道這都是不必要的，但不'restrict_xpath'接受任何標籤或我做錯了什麼？ –

爲什麼restrict_xpath忽略裏面<a>標籤裏面的hrefs？

回答

相關問題