2016-07-25 87 views
0

因爲我是scrapy的新手,所以我不知道問題出在哪裏可能非常容易解決。我希望找到一個解決方案。提前致謝。爲什麼我的「scrapy」不能刮取任何東西?

我使用utnutu 14.04,蟒蛇3.4

我的蜘蛛:

``

class EnActressSpider(scrapy.Spider): 
    name = "en_name" 
    allowed_domains = ["www.r18.com/", "r18.com/"] 
    start_urls = ["http://www.r18.com/videos/vod/movies/actress/letter=a/sort=popular/page=1",] 


def parse(self, response): 
    for sel in response.xpath('//*[@id="contents"]/div[2]/section/div[3]/ul/li'): 
     item = En_Actress() 
     item['image_urls'] = sel.xpath('a/p/img/@src').extract() 
     name_link = sel.xpath('a/@href').extract() 
     request = scrapy.Request(name_link, callback = self.parse_item, dont_filter=True) 
     request.meta['item'] = item 
     yield request 

    next_page = response.css("#contents > div.main > section > div.cmn-sec-item01.pb00 > div > ol > li.next > a::attr('href')") 
    if next_page: 
     url = response.urljoin(next_page[0].extract()) 
     yield scrapy.Request(url, self.parse, dont_filter=True) 



def parse_item(self, response): 
    item = reponse.meta['item'] 
    name = response.xpath('//*[@id="contents"]/div[1]/ul/li[5]/span/text()') 
    item['name'] = name[0].encode('utf-8') 
    yield item 

``

LOG:

``

{'downloader/request_bytes': 988, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 48547, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/301': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 7, 25, 6, 46, 36, 940936), 
'log_count/DEBUG': 1, 
'log_count/INFO': 1, 
'response_received_count': 1, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'spider_exceptions/TypeError': 1, 
'start_time': datetime.datetime(2016, 7, 25, 6, 46, 35, 908281)} 

``

任何幫助,非常感謝。

+0

你能提供鏈接到網站你的刮,或者更確切地說什麼url'parse()'方法接收?或者只是發佈蜘蛛文件的全部內容。 – Granitosaurus

+0

[鏈接](http://www.r18.com/videos/vod/movies/actress/letter=a/sort=popular/page=1)另外,我編輯了我的問題。謝謝。 Granitosaurus – Jin

回答

0

似乎很少有語法錯誤。我已經把它清理乾淨了,它在這裏似乎工作的很好。 我做的另一個編輯被刪除dont_filter參數從Request對象,因爲你不想刮重複。還調整allowed_domains,因爲它過濾掉了一些內容。 將來你應該發佈整個日誌。

import scrapy 
class EnActressSpider(scrapy.Spider): 
    name = "en_name" 
    allowed_domains = ["r18.com"] 
    start_urls = ["http://www.r18.com/videos/vod/movies/actress/letter=a/sort=popular/page=1", ] 

    def parse(self, response): 
     for sel in response.xpath('//*[@id="contents"]/div[2]/section/div[3]/ul/li'): 
      item = dict() 
      item['image_urls'] = sel.xpath('a/p/img/@src').extract() 
      name_link = sel.xpath('a/@href').extract_first() 
      request = scrapy.Request(name_link, callback=self.parse_item) 
      request.meta['item'] = item 
      yield request 

     next_page = response.css(
      "#contents > div.main > section > div.cmn-sec-item01.pb00 > " 
      "div > ol > li.next > a::attr('href')").extract_first() 
     if next_page: 
      url = response.urljoin(next_page) 
      yield scrapy.Request(url, self.parse) 

    def parse_item(self, response): 
     item = response.meta['item'] 
     name = response.xpath('//*[@id="contents"]/div[1]/ul/li[5]/span/text()').extract_first() 
     item['name'] = name.encode('utf-8') 
     yield item 
相關問題