Scrapy：如何使用XPath選擇div元素中的第一個標記

我正在使用Scrapy的SitemapSpider從各自的集合中提取所有產品鏈接。我的網站的名單都是Shopify商店，並鏈接到該產品的代碼如下所示：Scrapy：如何使用XPath選擇div元素中的第一個標記

<div class="grid__item grid-product medium--one-half large--one-third"> 
 
    <div class="grid-product__wrapper"> 
 
    <div class="grid-product__image-wrapper"> 
 
     <a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet"> 
 
     <img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image"> 
 
      </a> 
 
     
 
    </div> 
 

 
    <a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta"> 
 
     <span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span> 
 
     <span class="grid-product__price-wrap"> 
 
     <span class="long-dash">—</span> 
 
     <span class="grid-product__price"> 
 
      
 
      $ 15 
 
      
 
     </span> 
 
     </span> 
 
     
 
    </a> 
 
    </div> 
 
</div>

很顯然，無論是HREF的是完全一樣的。我在使用下面的代碼時，刮兩條鏈路的問題：

product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract()

我想選擇一種既具有標籤作爲後代的div元素。從那以後，我只想從第一個標籤中提取href以避免重複的鏈接。

雖然每個網站都是Shopify，但其集合頁面的源代碼並不完全相同。所以div元素下的一個標籤的深淺不一致，我不能添加一個謂詞像

//div[@class="grid__item grid-product medium--one-half large--one-third"]

來源

2017-09-16 barnesc

我做了一個全新的職位，闡明瞭我在問什麼，並給出了代碼的更加簡化。如果其他人有類似的問題，我寧願不刪除這個問題。 [鏈接到改進的問題]（https://stackoverflow.com/questions/46258500/xpath-and-scrapy-scraping-links-when-the-depth-and-quantity-of-a-tags-are-inco） – barnesc

product_links = response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")][1]/@href').extract() 
print(product_links[0]) # This is your first a Tag

來源

2017-09-16 00:20:20 Serjik

這不起作用。使用Scrapy的SitemapSpider時，response.xpath（）將返回它從站點地圖中刮取的每個頁面的列表。 print（product_links [0]）返回每個列表的第一個項目，所以我只會從每個集合中挖出第一個產品 – barnesc

只需使用extract_first()命令只提取第一個匹配元素。使用它的好處是，它避免了IndexError，並且在找不到與選擇匹配的任何元素時返回None。

所以，它應該是：

>>> response.xpath('//div//a[contains(@href, "collections") and contains(@href, "products")]/@href').extract_first() 
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

來源

2017-09-17 12:02:30

Scrapy：如何使用XPath選擇div元素中的第一個標記

回答

相關問題