2015-04-07 75 views
1

我有一個HTML頁面(seed)形式的提取物對(HREF,ALT):如何王氏蟒蛇scrapy

<div class="sth1"> 
    <table cellspacing="6" width="600"> 
     <tr> 
      <td> 
       <a href="link1"><img alt="alt1" border="0" height="22" src="img1" width="92"></a> 
      </td> 
      <td> 
       <a href="link1">name1</a> 
      </td> 
      <td> 
       <a href="link2"><img alt="alt2" border="0" height="22" src="img2" width="92"></a> 
      </td> 
      <td> 
       <a href="link2">name2</a> 
      </td> 
     </tr> 
    </table> 
</div> 

我想什麼做的是循環到所有<tr>的,並提取所有href, alt與python scrapy配對。在這個例子中,我應該得到:

link1, alt1 
link2, alt2 

回答

1

下面是來自Scrapy Shell一個例子:

$ scrapy shell index.html 
In [1]: for cell in response.xpath("//div[@class='sth1']/table/tr/td"): 
    ...:  href = cell.xpath("a/@href").extract() 
    ...:  alt = cell.xpath("a/img/@alt").extract() 
    ...:  print href, alt 

[u'link1'] [u'alt1'] 
[u'link1'] [] 
[u'link2'] [u'alt2'] 
[u'link2'] [] 

其中index.html包含的問題提供的樣本HTML。

1

你可以嘗試Scrapy的內置SelectorList與Python的拉鍊()合併:

from scrapy.selector import SelectorList 

xpq = '//div[@class="sth1"]/table/tr/td[./a/img]' 
cells = SelectorList(response.xpath(xpq)) 

zip(cells.xpath('a/@href'), cells.xpath('a/img/@alt')) 
=> [('link1', 'alt1'), ('link2', 'alt2')]