2015-07-13 51 views
1

我想從相對簡單的一段代碼中提取信息,但某些空白和<br>標記形成我的json文件時出錯。從div中提取信息並使其他字段的父項

這是最主要的股利與內容:

main_div

它具有代碼:

<div class="caixanorm"> 
    <div id="titulo"> 
     <a href="http://quonde.com.br/club-4/" rel="bookmark" title="Link para CLUB 4"> 
     <h2>CLUB 4</h2> 
     <h3 id="subtitulo">Academia        </h3> 
     </a> 
    </div> 
    <div id="endereco"> 
     (61) 3346-7423<br> 
     CRS 515, entrada W2     
    </div> 
    <div id="servecat"> 
     Em <a href="http://quonde.com.br/asasul/esporte/academias/" rel="category tag">Academias</a> da <a href="http://quonde.com.br/quadras/516-515/" rel="tag">516/515</a> Sul 
    </div> 
</div> 

這是我的代碼:

- item.py

import scrapy 

class QuondeItem(scrapy.Item): 
    localizacao = scrapy.Field() #location 
    titulo = scrapy.Field()  #title 
    subtitulo = scrapy.Field() #subtitle 
    telefone = scrapy.Field()  #phone 
    endereco = scrapy.Field()  #address 
    categoria = scrapy.Field() #category 
    quadra = scrapy.Field()  #block 

- my_spider.py

import scrapy 
from quonde.items import QuondeItem 


class MySpider(scrapy.Spider): 
    name = "quonde" 
    allowed_domains = ["quonde.com.br"] 
    start_urls = [ 
     "http://quonde.com.br/quadras/516-515/", 

    ] 

    def parse(self, response): 
     div = response.xpath('//div[@class="caixanorm"]') 
     items = [] 
     for sel in div: 
      item = QuondeItem() 
      item['localizacao'] = sel.xpath('//h1[@class="inline"]/span/text()').extract() 
      item['titulo'] = sel.xpath('//div[@id="titulo"]/a/h2/text()').extract() 
      item['subtitulo'] = sel.xpath('//div[@id="titulo"]/a/h3/text()').extract() 
      item['telefone'] = sel.xpath('//div[@id="endereco"]/text()[1]').extract() 
      item['endereco'] = sel.xpath('//div[@id="endereco"]/text()[2]').extract() 
      item['categoria'] = sel.xpath('//div[@id="servecat"]/a[1]/text()').extract() 
      item['quadra'] = sel.xpath('//div[@id="servecat"]/a[@rel="tag"]/text()').extract() 
      items.append(item) 
      return items 

正如我們所看到的,items.py的第一場不是在div描述的,因爲我想他是父項,其餘爲他的孩子......但,這是我得到的:JSON Result。電話和地址帶有HTML字符和空格,我不能讓每個塊的位置成爲所有其他塊的父親(explanation)。

除此之外,我不知道json本身的形成是否正確,例如,標題0對應於0字幕,除了它不應該只在一個單元格中,而是爲另一個單元格重複嗎?

對不起,英文謝謝!

回答

1

這裏的關鍵問題是,XPath表達式不是相對於當前選擇 - 你需要的在每個表達式的開頭。

此外,您不需要在循環中提取位置,之前執行此操作。

此外,爲了美化所提取的字段,使用一個Item Loader以及輸入和輸出處理器:

import scrapy 
from scrapy.contrib.loader import ItemLoader 
from scrapy.contrib.loader.processor import TakeFirst, MapCompose 


class QuondeItem(scrapy.Item): 
    localizacao = scrapy.Field() #location 
    titulo = scrapy.Field()  #title 
    subtitulo = scrapy.Field() #subtitle 
    telefone = scrapy.Field()  #phone 
    endereco = scrapy.Field()  #address 
    categoria = scrapy.Field() #category 
    quadra = scrapy.Field()  #block 


class QuondeItemLoader(ItemLoader): 
    default_input_processor = MapCompose(unicode.strip) 
    default_output_processor = TakeFirst() 

的修飾的蜘蛛代碼:

import scrapy 
from quonde.items import QuondeItem, QuondeItemLoader 


class MySpider(scrapy.Spider): 
    name = "quonde" 
    allowed_domains = ["quonde.com.br"] 
    start_urls = [ 
     "http://quonde.com.br/quadras/516-515/", 
    ] 

    def parse(self, response): 
     div = response.xpath('//div[@class="caixanorm"]') 
     location = response.xpath('.//h1[@class="inline"]/span/text()').extract()[0] 
     for sel in div: 
      loader = QuondeItemLoader(QuondeItem(), selector=sel) 

      loader.add_value("localizacao", location) 
      loader.add_xpath("titulo", './/div[@id="titulo"]/a/h2/text()') 
      loader.add_xpath("subtitulo", './/div[@id="titulo"]/a/h3/text()') 
      loader.add_xpath("telefone", './/div[@id="endereco"]/text()[1]') 
      loader.add_xpath("endereco", './/div[@id="endereco"]/text()[2]') 
      loader.add_xpath("categoria", './/div[@id="servecat"]/a[1]/text()') 
      loader.add_xpath("quadra", './/div[@id="servecat"]/a[@rel="tag"]/text()') 

      yield loader.load_item() 

這裏是產生JSON輸出:

[{"subtitulo": "Laborat\u00f3rio", "categoria": "Cl\u00ednicas e Consult\u00f3rios", "quadra": "516/515", "telefone": "(61) 3245-1275", "endereco": "CRS 515, Bl. B, Loja 77", "titulo": "Micra", "localizacao": "516/515"}, 
{"subtitulo": "Pneus e Rodas", "categoria": "Autom\u00f3veis", "quadra": "516/515", "telefone": "(61) 3346-1666", "endereco": "CRS 515, Bl. B, Loja 14", "titulo": "Impacto", "localizacao": "516/515"}, 
... 
{"subtitulo": "Cons\u00f3rcios", "categoria": "Consultorias e Assessorias", "quadra": "516/515", "telefone": "(61) 3346-8073", "endereco": "SHCS 516, Bl. C, Lj. 75", "titulo": "FERRAZ", "localizacao": "516/515"}, 
{"subtitulo": "Tape\u00e7aria", "categoria": "Decora\u00e7\u00f5es e Molduras", "quadra": "516/515", "telefone": "(61) 3245-3888", "endereco": "SHCS 516, Bl. C, Lj. 56", "titulo": "MUNDO DOS TAPETES", "localizacao": "516/515"}] 
+0

爲什麼在表達式開頭使用點? –

+0

@FilipeManuel否則,您將在循環的每次迭代中提取每個(例如)字幕。您需要將其設置爲特定於上下文。 – alecxe

+0

@FilipeManuel另請參閱:http://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths。 – alecxe