2017-03-16 62 views
0

我試圖抓取一個網站,以獲得它的用戶非常粗略的人口統計(沒有個人識別信息或照片),但從我已修改的官方文檔教程蜘蛛重複同一行的輸出4次一排。爲什麼我的Scrapy蜘蛛複製它的輸出?

我正在使用的代碼的副本如下:

注意,我已經包含在代碼示例性配置文件是假的/垃圾郵件帳戶。在可能已被刪除的情況下,您可以用網站上的任何其他網址替換該網址,並且該網址會再次運行。

import scrapy 

class DateSpider(scrapy.Spider): 
name = "date" 
start_urls = [ 
    'http://www.pof.com/viewprofile.aspx?profile_id=141659067', 
] 

def parse(self, response): 
    for container in response.xpath('//div[@class="user-details-wide"]'): 
     yield { 
      'Gender': response.xpath("//span[@id='gender']/text()").extract_first(), 
      'Age': response.xpath("//span[@id='age']/text()").extract_first(), 
      'State': response.xpath("//span[@id='state_id']/text()").extract_first(), 
      'Marital status': response.xpath("//span[@id='maritalstatus']/text()").extract_first(), 
      'Body': response.xpath("//span[@id='body']/text()").extract_first(), 
      'Height': response.xpath("//span[@id='height']/text()").extract_first(), 
      'Ethnicity': response.xpath("//span[@id='ethnicity']/text()").extract_first(), 
      'Does drugs?': response.xpath("//span[@id='drugs']/text()").extract_first(), 
      'Smokes?': response.xpath("//span[@id='smoke']/text()").extract_first(), 
      'Drinks?': response.xpath("//span[@id='drink']/text()").extract_first(), 
      'Has children?': response.xpath("//span[@id='haschildren']/text()").extract_first(), 
      'Wants children?': response.xpath("//span[@id='wantchildren']/text()").extract_first(), 
      'Star sign': response.xpath("//span[@id='zodiac']/text()").extract_first(), 
      'Education': response.xpath("//span[@id='college_id']/text()").extract_first(), 
      'Personality': response.xpath("//span[@id='fishtype']/text()").extract_first(), 
     } 

運行如下:

scrapy crawl date -o date.scv 

我要找的是一個一行頭之後一個行後直它的結果,而不是空白和重複我的輸出目前正在接受。

回答

1

您不需要使用for循環。只需找到一個span元素並從中提取所有數據。

此外,我建議你使用scrapy項目它更方便。 清除從空白中提取的數據的一種方法是使用xpath函數normalize-space()

import scrapy 
from items import DateSpiderItem 


class DateSpider(scrapy.Spider): 
    name = "date" 
    start_urls = [ 
     'http://www.pof.com/viewprofile.aspx?profile_id=141659067', 
    ] 

    def parse(self, response): 
     item = DateSpiderItem() 
     item['Gender'] = response.xpath(
      "//span[@id='gender']/text()").extract_first() 
     item['Age'] = response.xpath(
      "//span[@id='age']/text()").extract_first() 
     item['State'] = response.xpath(
      "//span[@id='state_id']/text()").extract_first() 
     item['Marital_status'] = response.xpath(
      "normalize-space(//span[@id='maritalstatus']/text())").extract_first() 
     item['Body'] = response.xpath(
      "//span[@id='body']/text()").extract_first() 
     item['Height'] = response.xpath(
      "//span[@id='height']/text()").extract_first() 
     item['Ethnicity'] = response.xpath(
      "//span[@id='ethnicity']/text()").extract_first() 
     item['Does_drugs'] = response.xpath(
      "normalize-space(//span[@id='drugs']/text())").extract_first() 
     item['Smokes'] = response.xpath(
      "//span[@id='smoke']/text()").extract_first() 
     item['Drinks'] = response.xpath(
      "normalize-space(//span[@id='drink']/text())").extract_first() 
     item['Has_children'] = response.xpath(
      "normalize-space(//span[@id='haschildren']/text())").extract_first() 
     item['Wants_children'] = response.xpath(
      "normalize-space(//span[@id='wantchildren']/text())").extract_first() 
     item['Star_sign'] = response.xpath(
      "//span[@id='zodiac']/text()").extract_first() 
     yield item 

項目文件:

class DateSpiderItem(scrapy.Item): 
    Gender = scrapy.Field() 
    Age = scrapy.Field() 
    State = scrapy.Field() 
    Marital_status = scrapy.Field() 
    Body = scrapy.Field() 
    Height = scrapy.Field() 
    Ethnicity = scrapy.Field() 
    Does_drugs = scrapy.Field() 
    Smokes = scrapy.Field() 
    Drinks = scrapy.Field() 
    Has_children = scrapy.Field() 
    Wants_children = scrapy.Field() 
    Star_sign = scrapy.Field() 
    Education = scrapy.Field() 
    Personality = scrapy.Field() 

輸出:

enter image description here

+0

不得不移動和重命名一些文件,但我得到了你的代碼工作。它看起來也很整潔!非常感謝您的幫助,我非常感謝。 –

+0

沒問題。我很高興我能幫助你。 – vold