我試圖廢棄某些類「後項目後項目xxxxx」的鏈接。但由於每個班級都不同，我怎樣才能抓住他們呢？Scrapy查找具有不同（類似）類的所有鏈接

<li class="post-item post-item-18887"><a 
href="http://example.com/archives/18887.html" title="Post1"</a></li> 
<li class="post-item post-item-18883"><a href="http://example.com/archives/18883.html" title="Post2"</a></li>

我的代碼：

廢料所有的網吧來自example.com

class DengaSpider(scrapy.Spider): 
    name = 'cafes' 
    allowed_domains = ['example.com'] 
    start_urls = [ 
     'http://example.com/archives/8136.html', 
    ] 

    rules = [ 
     Rule(
      LinkExtractor(
       allow=('^http://example\.com/archives/\d+.html$'), 
       unique=True 
      ), 
      follow=True, 
      callback="parse_items" 
     ) 
    ] 

    def parse(self, response): 
     cafelink = response.css('post.item').xpath('//a/@href').extract() 
     if cafelink is not None: 
      print(cafelink)

鏈接的CSS部分不能正常工作，我該如何解決？

來源

2017-05-08 DatCra

的Xpath有這個方法，所以你可以試試這個：

cafelink = response.xpath("//*[contains(@class, 'post-item-')]//a/@href").extract()

XPath中使用//時也要小心。它使xpath在文檔根目錄中開始搜索，無論它現在在哪裏。

來源

2017-05-08 11:49:54 rrschmidt

// *給我語法錯誤，嘗試了幾種不同的方式仍然是一樣的錯誤 – DatCra

對不起，忘了引號，編輯我的答案修復它 – rrschmidt

下面是scrapy shell對上述HTML樣品運行：

>>> from scrapy.http import HtmlResponse 
>>> response = HtmlResponse(url="Test HTML String", body='<li class="post-item post-item-18887"><a href="http://example.com/archives/18887.html" title="Post1"</a></li><li class="post-item post-item-18883"><a href="http://example.com/archives/18883.html" title="Post2"</a></li>', encoding='utf-8') 
>>> 
>>> cafelink = response.css('li.post-item a::attr(href)').extract_first() 
>>> cafelink 
'http://example.com/archives/18887.html' 
>>> 
>>> cafelink = response.css('li.post-item a::attr(href)').extract() 
>>> cafelink 
['http://example.com/archives/18887.html', 'http://example.com/archives/18883.html']

來源

2017-05-08 11:56:53 JkShaw

它的工作！「yield」只返回第一個結果，「print」顯示整個列表。任何想法爲什麼？ – DatCra

'yield'與返回'1st'結果無關，如果你'extract_first（）'，那麼只有'1st'結果被提取，如果你使用'extract（）'然後'all results'被提取。 – JkShaw

如果你還想的項目有「項目後」類，那麼爲什麼你需要他們的其他類來捕捉它們？如果你仍然需要做的是，嘗試「打頭的」 CSS選擇器：

response.css('li[class^="post-item post-item-"]')

文檔here。

來源

2017-05-08 22:28:16 lufte

謝謝。我應該從一開始就用css代替xpath。 – DatCra

Scrapy查找具有不同（類似）類的所有鏈接

廢料所有的網吧來自example.com

回答

相關問題