2014-09-28 136 views
0

我是Scrapy(& Python!)的新手,我試圖取消Cricinfo網站的評論。 這裏是一個網頁的例子: http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary使用python&scrapy颳去網站

我感興趣的是刮過數字(比如0.1)和它旁邊的文字。

使用Firebug我可以看到「0.1」的xpath是: /html/body/div [2]/div [3]/div [4]/div [5]/div/div [3 ]/table/tbody/tr/td [2]/div/table/tbody/tr [2]/td [1]/p

和旁邊的文字是: /html/body/2]/DIV [3]/DIV [4]/DIV [5]/DIV/DIV [3] /表/ tbody的/ TR/TD [2]/DIV /表/ tbody的/ TR [2]/TD [2 ]/p

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from crictest.items import CrictestItem 

class MySpider(BaseSpider): 
    name = "cricinfo" 
    allowed_domains = ["espncricinfo.com/"] 
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     rows = hxs.select('//html/body/div[2]/div[3]/div[4]/div[5]/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr') 
     items =[] 
     for row in rows: 
      item = CrictestItem() 
      item['overnum'] = row.select('td[1]/p/text()').extract() 
      item['overnumtext'] = row.select('td[2]/p/text()').extract() 
      items.append(item) 
     return items 

我通過行試圖環路(/ TR),然後返回TD [1]/p /文本,然後TD [2]/p /文本 我items.py是這樣的:

import scrapy 


class CrictestItem(scrapy.Item): 
    overnum = scrapy.Field() 
    overnumtext = scrapy.Field() 

使用scrapy crawl cricinfo -o items.csv -t csv它只是給我一個items.csv文件,根本沒有數據。

我哪裏錯了?任何幫助將不勝感激。

回答

1

你擁有的xpath是不正確的,而且非常脆弱。

據我所知,您需要粗體數字和旁邊的文字。我會依靠與battingCommstd元素:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from crictest.items import CrictestItem 


class MySpider(BaseSpider): 
    name = "cricinfo" 
    allowed_domains = ["espncricinfo.com/"] 
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     rows = hxs.select('//td[@class="battingComms" and b]') 
     for row in rows: 
      item = CrictestItem() 
      item['overnum'] = row.select('b/text()').extract()[0] 
      item['overnumtext'] = row.select('b/following-sibling::text()').extract()[0] 
      yield item 

輸出在控制檯上:

{'overnum': u'0.4', 
'overnumtext': u" bingo! that's a good ol slog from van Wyk right across the line of a good length ball that nips back in. No bat involved, but loads of timber. Lovely bowling from Paris and he knows it "} 
{'overnum': u'1.3', 
'overnumtext': u' and dies by his reputation. Behrendorff is assisted by some swing away, Delport flings his bat at with all his might and only ends up with an outside edge that is pouched behind the wicket. Brilliant catch from Whiteman as he leaps to his left and stretches as high as he could '} 
... 
+0

這似乎是更喜歡它,但它並沒有拿起每一個號碼。它只顯示11條記錄?另外我應該如何發現關於battingComms類?謝謝 – Del 2014-09-30 17:28:41

+0

@Del,我怎麼知道你想從頁面上得到什麼? – alecxe 2014-09-30 17:34:00

+0

如果我不清楚,我很抱歉。我想要一個帶有2列的csv文件。一列說數字:0.1,0.3 ... 19.5,19.6。另一欄顯示網頁上該號碼旁邊的文字。 – Del 2014-09-30 18:57:29

0

你可以從下面的例子確切的結果。

使用python next兄弟來獲得合適的結果。

的HTML代碼是:

<div id="provider-region-addresses"> 
<h3>Contact details</h3> 
<h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>North Shore Hospital</dd><dt>Physical address</dt> 
       <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt> 
       <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt> 
       <dd>0740</dd><dt>District/town</dt> 

       <dd> 
       North Shore, Takapuna</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 486 8996</dd><dt>Fax</dt> 
       <dd>(09) 486 8342</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>Physical address</dt> 
       <dd>Helensville</dd><dt>Postal address</dt> 
       <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt> 
       <dd>0840</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Helensville</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 420 9450</dd><dt>Fax</dt> 
       <dd>(09) 420 7050</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>Physical address</dt> 
       <dd>Warkworth</dd><dt>Postal address</dt> 
       <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt> 
       <dd>0941</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Warkworth</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 422 2700</dd><dt>Fax</dt> 
       <dd>(09) 422 2709</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>Waitakere Hospital</dd><dt>Physical address</dt> 
       <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt> 
       <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt> 
       <dd>0650</dd><dt>District/town</dt> 

       <dd> 
       Waitakere, Henderson</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 839 0000</dd><dt>Fax</dt> 
       <dd>(09) 837 6634</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt> 
       <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt> 
       <dd>0932</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Red Beach</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 427 0300</dd><dt>Fax</dt> 
       <dd>(09) 427 0391</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    </div> 

蜘蛛的代碼是:

def parse(self, response): 
     hxs = HtmlXPathSelector(response) 

     practice = hxs.select('//h1/text()').extract() 
     items1 = [] 

     results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl') 
     for result in results: 
      item = WebhealthItem1() 
      #item['url'] = result.select('//dl/a/@href').extract() 
      item['practice'] = practice 
      item['hours'] = map(unicode.strip, 
       result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract()) 
      item['more_hours'] = map(unicode.strip, 
       result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract()) 
      item['physical_address'] = map(unicode.strip, 
       result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract()) 
      item['postal_address'] = map(unicode.strip, 
       result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract()) 
      item['postcode'] = map(unicode.strip, 
       result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract()) 
      item['district_town'] = map(unicode.strip, 
       result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract()) 
      item['region'] = map(unicode.strip, 
       result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract()) 
      item['phone'] = map(unicode.strip, 
       result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract()) 
      item['website'] = map(unicode.strip, 
       result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract()) 
      item['email'] = map(unicode.strip, 
       result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract()) 
      items1.append(item) 
     return items1