在Scrapy中連接Xpath嵌套文本

我一直試圖在Scrapy中連接一些嵌套文本和xpath。我認爲它使用xpath 1.0？我看了一堆其他職位，但似乎沒有得到相當我想要的東西在Scrapy中連接Xpath嵌套文本

下面是HTML的特定部分（實際頁http://adventuretime.wikia.com/wiki/List_of_episodes）：

<tr> 
<td colspan="5" style="border-bottom: #BCD9E3 3px solid"> 
    Finn and Princess Bubblegum must protect the <a href="/wiki/Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created. 
</td> 
</tr> 

<tr> 
<td colspan="5" style="border-bottom: #BCD9E3 3px solid"> 
Finn must travel to <a href="/wiki/Lumpy_Space" title="Lumpy Space">Lumpy Space</a> to find a cure that will save Jake, who was accidentally bitten by <a href="/wiki/Lumpy_Space_Princess" title="Lumpy Space Princess">Lumpy Space Princess</a> at Princess Bubblegum's annual 'Mallow Tea Ceremony.' 
</td> 
</tr> 

(much more stuff here)

這裏是我的結果想回：

[u'Finn and Princess Bubblegum must protect the Candy Kingdom from a horde of candy zombies they accidentally 
    created.\n', u'Finn must travel to Lumpy Space to find a cure that will save Jake, who was accidentally bitten', (more stuff here)]

我已經使用了答案試圖從 HTML XPath: Extracting text mixed in with multiple tags?

description =sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']/parent::tr/td[descendant-or-self::text()]").extract()

但這只是讓我回來

[u'<td colspan="5" style="border-bottom: #BCD9E3 3px solid">Finn and Princess Bubblegum must protect the <a href="/wiki/ 
Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created.\n</td>',

的string()答案似乎並沒有對我也工作...我回來只有一個條目清單，並應該有很多。

我已經得到最接近的是：

description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract()

，這讓我回

[u'Finn and Princess Bubblegum must protect the ', u'Candy Kingdom', u' from a horde of candy zombies they accidentally 
created.\n', u'Finn must travel to ', u'Lumpy Space', u' to find a cure that will save Jake, who was accidentally bitten, (more stuff here)]

任何人有XPath的技巧上串聯？

謝謝！

編輯：蜘蛛代碼經由手動join()

class AT_Episode_Detail_Spider_2(Spider): 

    name = "ep_detail_2" 
    allowed_domains = ["adventuretime.wikia.com"] 
    start_urls = [ 
     "http://adventuretime.wikia.com/wiki/List_of_episodes" 
    ] 

    def parse(self, response): 
     sel = Selector(response) 

     description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract() 
     print description

來源

2015-07-20 pyramidface

串連：

description = " ".join(sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract())

或者使用結合一個Join()處理器與Item Loader。

下面是一個簡單的代碼來獲得插曲說明的列表：

def parse(self, response): 
    description = [" ".join(row.xpath(".//text()[not(ancestor::sup)]").extract()) 
        for row in response.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan]")] 
    print description

來源

2015-07-20 01:27:58 alecxe

'加入（）'是不是完全是我要找的。我應該更具體一點。請注意，在我想要返回的數據中，不止有一個字符串。我只想將文本與其他標籤組合在一起，但不是將所有文本和標籤組合在一起。我會更新我的html真的很快... – pyramidface

@pyramidface你可以也可以用'join（）來解決它。除此之外，您可能需要遍歷行以製作說明列表。你還可以發佈完整的蜘蛛代碼，以便我可以更好地理解上下文嗎？謝謝！ – alecxe

@pyramidface好的，我已經更新了答案，包括獲取描述列表的代碼。這是你問的嗎？謝謝。 – alecxe

在Scrapy中連接Xpath嵌套文本

回答

相關問題