Scrapy xpath <字符後刪除文本

我想從this頁面獲取產品信息。爲了得到描述（出現在頁面的底部），我使用XPathScrapy xpath <字符後刪除文本

response.xpath('//*[@itemprop="description"]/table//text()').extract()[3].strip()

這使我的描述：

u'Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section ('

而一個目前在網站上是

Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (<2cm), Belt Length: 93cm 
Product Type: Belts, Accessories

我已驗證網站上的內容即使在禁用javascript後也會加載。我在這裏錯過了什麼？

來源

2015-11-03 Pravesh Jain

它看起來像是因爲'<'符號而被切斷，甚至BeautifulSoup在'<'之後切出文本......非常奇怪 – heinst

這是一個'parsel'錯誤，我會在存儲庫上檢查它[這裏]（https://github.com/scrapy/parsel/issues/23） – eLRuLL

有幫助嗎？ – eLRuLL

這仍然應該處理沒有任何破解但你能得到這個工作：

from parsel import Selector 
... 

s = Selector(text=response.body_as_unicode(), type='xml') 
s.xpath('//*[@itemprop="description"]/table//text()').extract()[3].strip() 
# gives u'Color: White, Size:Free Size, With the body: Braided, Buckle: Automatic Deduction, With the body width: section (2cm), Belt Length: 93cm'

這裏的問題是，parsel（內scrapy分析器）使用lxml.etree.HtmlParser(recover=True, encoding='utf8')從而消除這種奇怪的字符避免問題。

來源

2015-11-03 15:53:11 eLRuLL

Scrapy xpath <字符後刪除文本

回答

相關問題