2016-07-04 82 views
0

我想從以下鏈接中查看'amazon.in'中列出的手機的詳細信息:here using scrapy。從使用scrapy通過ajax接收文本的html標記中颳去數據

這裏是我的代碼:

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy import Selector 
from tars.items import ProductNameItem 
import re as r 

class Namespider(CrawlSpider): 
    name = "flash" 
    allowed_domains = ["amazon.in"] 
    def __init__(self, *args, **kwargs): 
     super(Namespider, self).__init__(*args, **kwargs) 
     self.start_urls = [kwargs.get('start_url')] 

    rules = (
     Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@id="pagnNextLink"]')), callback="parse_start_url", follow= True), 
) 

    def parse_start_url(self, response): 
     hxs = Selector(response) 
     titles = hxs.xpath('//li[@class="s-result-item celwidget "]') 

     items = [] 
     for i in titles: 

      item = ProductNameItem() 

      #x-paths: 
      name_xpath = "div[1]/div[3]/div[1]/a[1]/h2[1]/text()" 
      url_xpath = "div[1]/div[3]/div[1]/a[1]/@href" 
      price_xpath = "div[1]/div[5]/div[1]/a[1]/span[1]/text()" 
      total_reviews_xpath = "div[1]/div[4]/a[1]/text()" 

      #data-extraction: 
      item["name"] = ' '.join(i.xpath(name_xpath).extract()) 
      item["url"] = ' '.join(i.xpath(url_xpath).extract()) 
      item["price"] = ' '.join(i.xpath(price_xpath).extract()) 
      item["total_reviews"] = ' '.join(i.xpath(total_reviews_xpath).extract()) 

      #append all data 
      items.append(item) 


     return(items) 

的代碼工作正常,但我沒有得到任何數據價格和total_reviews領域。我經過多次交叉檢查,x路徑也是正確的,但我進一步探討了這些x路徑中的'a'和'span'標籤有一些不尋常的地方。這些標籤中的內容使用ajax或類似的東西加載。 如果任何人都可以提供一些關於如何從這些html標籤中抓取數據的幫助。

+0

你的xpaths基本上是要求失敗,使用類名等等來獲得你所追求的。我們應該怎麼知道你正在嘗試什麼,並且你發佈的鏈接無處可去 –

+0

我的不好。現在鏈接很好。我也嘗試過類,直到'div'標籤獲取數據,但只要我進入'a'標籤,就沒有任何東西被返回。 –

+0

你試圖得到什麼三樣東西? –

回答

0

沒有加載Ajax調用,您的XPath是錯誤的,下面的CSS選擇器獲取所有評論:

In [13]: response.css("#container a[href*='#customerReviews']::text").extract() 
Out[13]: 
[u'9,812', 
u'32', 
u'32', 
u'17,301', 
u'1,408', 
u'99', 
u'9,816', 
u'9,808', 
u'17,298', 
u'91', 
u'91', 
u'8,351', 
u'9,585', 
u'9,808', 
u'10,223', 
u'174', 
u'809', 
u'671', 
u'5,215', 
u'5,215', 
u'1,776', 
u'462', 
u'671', 
u'1,147'] 

的名稱,價格,鏈接和評論的數量都是的div裏面在S-項容器類:

In [24]: divs = response.css("div.s-item-container") 

In [25]: for d in divs:        
      anchor = d.css("a.a-link-normal.s-access-detail-page.a-text-normal")[0] 
      name = anchor.xpath("./h2/@data-attribute").extract_first() 
      reviews = d.css("a[href*='#customerReviews']::text").extract_first() 
      a = d.css("a.a-link-normal.a-text-normal")[0] 
      link = a.xpath("@href").extract_first() 
      price = d.css("span.a-size-base.a-color-price.s-price.a-text-bold::text").extract_first() 
      print(name, price, reviews, link) 
    ....:  
(u'Moto G Plus, 4th Gen (Black, 32 GB)', u'14,999.00', u'9,813', u'http://www.amazon.in/Moto-Plus-4th-Gen-Black/dp/B01DDP7GZK') 
(u'Moto G, 4th Gen (Black, 16GB)', u'12,499.00', u'32', u'http://www.amazon.in/Moto-4th-Gen-Black-16GB/dp/B01DDP7YI4') 
(u'Moto G, 4th Gen (White, 16GB)', u'12,499.00', u'32', u'http://www.amazon.in/Moto-4th-Gen-White-16GB/dp/B01DDP7GR8') 
(u'Lenovo Vibe K4 Note (Black, 16GB)', u'10,999.00', u'17,301', u'http://www.amazon.in/Lenovo-Vibe-K4-Note-Black/dp/B01A11D2U2') 
(u'OnePlus 3 (Graphite, 64GB)', u'27,999.00', u'1,408', u'http://www.amazon.in/OnePlus-3-Graphite-64GB/dp/B01DDP7UQ0') 
(u'Lenovo Vibe K5 (Gold, 16GB)', u'6,999.00', u'100', u'http://www.amazon.in/Lenovo-Vibe-K5-Gold-16GB/dp/B01DDP7UYC') 
(u'Moto G Plus, 4th Gen (White, 32 GB)', u'14,999.00', u'9,821', u'http://www.amazon.in/Moto-Plus-4th-Gen-White/dp/B01DDP85BY') 
(u'Moto G Plus, 4th Gen (Black, 16 GB)', u'13,499.00', u'9,819', u'http://www.amazon.in/Moto-Plus-4th-Gen-Black/dp/B01DDP87N0') 
(u'Lenovo Vibe K4 Note (White,16GB)', u'10,999.00', u'17,302', u'http://www.amazon.in/Lenovo-Vibe-K4-Note-White/dp/B01BHUN4S6') 
(u'Lenovo Vibe K5 (Silver, 16GB)', u'6,999.00', u'107', u'http://www.amazon.in/Lenovo-Vibe-K5-Silver-16GB/dp/B01DDP7D3A') 
(u'Xiaomi Redmi Note 3 (Silver, 32GB)', u'11,999.00', u'1,227', u'http://www.amazon.in/Xiaomi-Redmi-Note-Silver-32GB/dp/B01DK5K8WG') 
(u'Lenovo Vibe K5 (Grey, 16GB)', u'6,999.00', u'105', u'http://www.amazon.in/Lenovo-Vibe-K5-Grey-16GB/dp/B01DDP7MFE') 
(u'OnePlus X (Onyx, 16GB)', u'14,999.00', u'9,585', u'http://www.amazon.in/OnePlus-E1003-X-Onyx-16GB/dp/B016UPKCGU') 
(u'Moto G Plus, 4th Gen (White, 16 GB)', u'13,499.00', u'9,819', u'http://www.amazon.in/Moto-Plus-4th-Gen-White/dp/B01DDP85KU') 
(u'Coolpad Note 3 (Black, 16GB)', u'8,499.00', u'10,223', u'http://www.amazon.in/Coolpad-Note-3-Black-16GB/dp/B0158IT7ES') 
(u'Intex Aqua Speed HD (White-Champagne, 8GB)', u'4,190.00', u'174', u'http://www.amazon.in/Intex-Aqua-Speed-HD-White-Champagne/dp/B01FD7QTEK') 
(u'Asus Zenfone Max ZC550KL-6A068IN (Black, 2GB, 16GB)', u'8,999.00', u'810', u'http://www.amazon.in/Asus-Zenfone-ZC550KL-6A068IN-Black-16GB/dp/B018VKZPG4') 
(u'Coolpad Note 3 Plus (Champagne-White)', u'8,999.00', u'670', u'http://www.amazon.in/Coolpad-Note-3-Plus-Champagne-White/dp/B01DDP7V7S') 
(u'Redmi 2 (White)', u'5,999.00', u'5,215', u'http://www.amazon.in/Mi-Redmi-2-White/dp/B00VEB055E') 
(u'Lenovo Vibe X3 (White, 32GB)', u'19,999.00', u'1,776', u'http://www.amazon.in/Lenovo-Vibe-X3-White-32GB/dp/B01AY3H9QA') 
(u'XOLO Era X (Black)', u'10,000.00', u'462', u'http://www.amazon.in/XOLO-ERA-X-Era-Black/dp/B01BWL1A0O') 
(u'Coolpad Note 3 Plus (Gold)', u'8,999.00', u'671', u'http://www.amazon.in/Coolpad-Note-3-Plus-Gold/dp/B01DDP7DK8') 
(u'HTC Desire 620G (Santroni White)', u'8,699.00', u'1,147', u'http://www.amazon.in/HTC-Desire-620G-Santroni-White/dp/B00R7FPSDU') 
(u'Samsung Tizen Z3 (Silver)', u'5,590.00', u'9', u'http://www.amazon.in/Samsung-Tizen-Z3-Silver/dp/B01CXXJ8UY') 

始終嘗試使用類名,屬性等。找到你的內容,如果你還想要測試的XPath做一個scrapy她不會在你的瀏覽器中。

+0

我實際上一步一步地驗證了x-paths,但仍然結束了錯誤。學到的教訓,從現在開始總是會使用x-路徑的類。非常感謝您的幫助和建議。 –

+0

@ShubhamPandey,即使他們碰巧使用像你的xpaths工作會很容易打破,有更多的機會移動周圍比IDS,類名稱更改等.. –