Scrapy。從div中提取html而不包裝父標記

我使用scrapy來抓取網站。Scrapy。從div中提取html而不包裝父標記

我想提取某些div的內容。

<div class="short-description"> 
{some mess with text, <br>, other html tags, etc} 
</div> 

loader.add_xpath('short_description', "//div[@class='short-description']/div")

通過代碼我得到了我的需要，但結果包括包裝HTML（<div class="short-description">...</div>）

如何擺脫父HTML標籤？

備註。像文本（），節點（）選擇器不能幫助我，因爲我的div包含<br>, <p>, other divs, etc.，空格，我需要保留它們。

來源

2013-03-25 Dimitry

hxs = HtmlXPathSelector(response) 
for text in hxs.select("//div[@class='short-description']/text()").extract(): 
    print text

來源

2013-03-26 01:35:55 perreal

組合嘗試node()與Join()：

loader.get_xpath('//div[@class="short-description"]/node()', Join())

，結果看起來像：

>>> from scrapy.contrib.loader import XPathItemLoader 
>>> from scrapy.contrib.loader.processor import Join 
>>> from scrapy.http import HtmlResponse 
>>> 
>>> body = """ 
...  <html> 
...   <div class="short-description"> 
...    {some mess with text, <br>, other html tags, etc} 
...    <div> 
...     <p>{some mess with text, <br>, other html tags, etc}</p> 
...    </div> 
...    <p>{some mess with text, <br>, other html tags, etc}</p> 
...   </div> 
...  </html> 
... """ 
>>> response = HtmlResponse(url='http://example.com/', body=body) 
>>> 
>>> loader = XPathItemLoader(response=response) 
>>> 
>>> print loader.get_xpath('//div[@class="short-description"]/node()', Join()) 

      {some mess with text, <br> , other html tags, etc} 
      <div> 
       <p>{some mess with text, <br>, other html tags, etc}</p> 
      </div> 
      <p>{some mess with text, <br>, other html tags, etc}</p> 
>>> 
>>> loader.get_xpath('//div[@class="short-description"]/node()', Join()) 
u'\n   {some mess with text, <br> , other html tags, etc}\n 
    <div>\n   <p>{some mess with text, <br>, other html tags, etc}</p>\n 
    </div> \n  <p>{some mess with text, <br>, other html tags, etc}</p> \n'

來源

2013-03-26 03:33:00

Scrapy。從div中提取html而不包裝父標記

回答

相關問題