在Scrapy串聯Xpath的嵌套文本 - 2.0

我想提取這個網站是在itemprop =「配料」中的所有文本。在Scrapy串聯Xpath的嵌套文本 - 2.0

我看到this answer，這是我想要的東西，但也有指定的元素，和我的文字是不是嵌套。

這是HTML：

<li itemprop="ingredients">Beginning of ingredient 
    <a href="some-link" data-ct-category="Other" 
    data-ct-action="Site Search" 
    data-ct-information="Recipe Search - Hellmann's® or Best Foods® Real Mayonnaise" 
    data-ct-attr="some_attr">Rest of Ingredient</a> 
</li> 
<li itemprop="ingredients">Another ingredient</li> 
<li itemprop="ingredients">Another ingredient</li> 
<li itemprop="ingredients">Another ingredient</li> 
<li itemprop="ingredients">Another ingredient</li> 
<li itemprop="ingredients">Another ingredient</li>

我需要的是讓背課文，作爲一個列表，這個列表中的第一個元素將是「的成分插入空間開頭這裏，參加什麼其餘的成分「，其他元素將是」另一種成分「。

我接近有：

for row in response.xpath('//*[@itemprop="ingredients"]/descendant-or-self::*/text()'): 
...  print row.extract() 
... 
Beginning of ingredient 
Rest of Ingredient 

    Another ingredient 
    Another ingredient 
    Another ingredient 
    Another ingredient 
    Another ingredient

所以，當我把它放在一個列表使用extract_first（）上的每一行，我得到這個：

['Beginning of ingredient', "Rest of Ingredient", 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

但我想這一點：

['Beginning of ingredient Rest of Ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

來源

2016-09-26 Elvira Gandelman

您已經關閉，每過一個li元素，然後調用上下文相關的descendant-or-self ：

In [1]: [" ".join(map(unicode.strip, item.xpath("descendant-or-self::text()").extract())) 
     for item in response.xpath('//li[@itemprop="ingredients"]')] 
Out[1]: 
[u'Beginning of ingredient Rest of Ingredient ', 
u'Another ingredient', 
u'Another ingredient', 
u'Another ingredient', 
u'Another ingredient', 
u'Another ingredient']

來源

2016-09-26 14:02:29 alecxe

我不能有序> 127（著名錯誤：UnicodeEncodeError： 'ASCII' 編解碼器不能在位置16編碼字符U '\ XAE'：序數不在範圍內（128）） –

在Scrapy串聯Xpath的嵌套文本 - 2.0

回答

相關問題