2013-05-01 84 views
1

我遇到與Scrapy收集的數據有關的問題。看來,當我運行這段代碼的終端,所收集的全部追加到一個項目看起來的信息,如:Scrapy - 如何避免將收集的信息分組到一個項目

{"fax": ["Fax: 617-638-4905", "Fax: 925-969-1795", "Fax: 913-327-1491", "Fax: 507-281-0291", "Fax: 509-547-1265", "Fax: 310-437-0585"], 
"title": ["Challenges in Musculoskeletal Rehabilitation", "17th Annual Spring Conference on Pediatric Emergencies", "19th Annual Association of Professors of Human & Medical Genetics (APHMG) Workshop & Special Interest Groups Meetings", "2013 AMSSM 22nd Annual Meeting", "61st Annual Meeting of Pacific Coast Reproductive Society (PCRS)", "Contraceptive Technology Conference 25th Anniversary", "Mid-America Orthopaedic Association 2013 Meeting", "Pain Management", "Peripheral Vascular Access Ultrasound", "SAGES 2013/ISLCRS 8th International Congress"], ... ... 

...等

的問題是,所有的每個領域的信息被截取在一個項目中。我需要這些信息作爲單獨的項目出來。換句話說,我需要每個標題與相關聯一個傳真號碼(如果存在)和一個位置等。

我不希望所有信息都顯示在一起,因爲收集的每條信息都與其他信息有一定的關係。我最終希望它進入數據庫的方式如下:

「MedEconItem」1:[title:「在此插入標題1」,傳真:「在此插入傳真#1」,位置:「位置1」 ...]

「MedEconItem」 2:[標題: 「標題2」,傳真: 「傳真#2」,位置: 「位置2」 ...]

「MedEconItem」 3:[。 ..等等

有關如何解決這個問題的任何想法?有人知道如何輕鬆分離這些信息嗎?這是我第一次使用Scrapy,因此歡迎任何建議。我到處尋找,我似乎無法找到答案。

這是目前我的代碼:

import scrapy 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field 

class MedEconItem(Item): 
    title = Field() 
    date = Field() 
    location = Field() 
    specialty = Field() 
    contact = Field() 
    phone = Field() 
    fax = Field() 
    email = Field() 
    url = Field() 

class autoupdate(BaseSpider): 
    name = "medecon" 
    allowed_domains = ["www.doctorsreview.com"] 
    start_urls = [ 
     "http://www.doctorsreview.com/meetings/search/?region=united-states&destination=all&specialty=all&start=YYYY-MM-DD&end=YYYY-MM-DD", 
     ] 

    def serialize_field(self, field, name, value): 
     if field == '': 
      return super(MedEconItem, self).serialize_field(field, name, value) 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]') 
     items = [] 
     for site in sites: 
      item = MedEconItem() 
      item['title'] = site.select('//h3/a/text()').extract() 
      item['date'] = site.select('//p[@class = "dls"]/span[@class = "date"]/text()').extract() 
      item['location'] = site.select('//p[@class = "dls"]/span[@class = "location"]/a/text()').extract() 
      item['specialty'] = site.select('//p[@class = "dls"]/span[@class = "specialties"]/text()').extract() 
      item['contact'] = site.select('//p[@class = "contact"]/text()').extract() 
      item['phone'] = site.select('//p[@class = "phone"]/text()').extract() 
      item['fax'] = site.select('//p[@class = "fax"]/text()').extract() 
      item['email'] = site.select('//p[@class = "email"]/text()').extract() 
      item['url'] = site.select('//p[@class = "website"]/a/@href').extract() 
      items.append(item) 
     return item 

回答

0

好了,下面的代碼看起來工作,但可悲的是涉及到一些明目張膽的黑客,因爲我在XPath的可怕。熟悉xpath的人可能會在稍後提供更好的解決方案。

def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]//a[contains(@href,"meetings")]') 
     items = [] 
     for site in sites[1:-1]: 
      item = MedEconItem() 
      item['title'] = site.select('./text()').extract() 
      item['date'] = site.select('./following::p[@class = "dls"]/span[@class="date"]/text()').extract()[0] 
      item['location'] = site.select('./following::p[@class = "dls"]/span[@class = "location"]/a/text()').extract()[0] 
      item['specialty'] = site.select('./following::p[@class = "dls"]/span[@class = "specialties"]/text()').extract()[0] 
      item['contact'] = site.select('./following::p[@class = "contact"]/text()').extract()[0] 
      item['phone'] = site.select('./following::p[@class = "phone"]/text()').extract()[0] 
      item['fax'] = site.select('./following::p[@class = "fax"]/text()').extract()[0] 
      item['email'] = site.select('./following::p[@class = "email"]/text()').extract()[0] 
      item['url'] = site.select('./following::p[@class = "website"]/a/@href').extract()[0] 
      items.append(item) 
     return items 
+0

我試過這個代碼,但它引發了一個NotImplementedError。它表示它抓取了網站,但它說GET在GET說錯誤:錯誤:Spider錯誤處理 knn360 2013-05-01 20:01:04

+0

這很奇怪。你正在使用什麼版本的scrapy? – Talvalin 2013-05-02 07:23:38

+0

我正在使用Scrapy 0.16.4 – knn360 2013-05-03 02:22:33

相關問題