如何在沒有HTML標記的情況下選擇文本

我正在使用一個網頁抓取工具（使用Python），所以我有一大塊HTML，我試圖從中提取文本。其中一個代碼片段如下所示：如何在沒有HTML標記的情況下選擇文本

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

我想從該類中提取文本。現在，我可以用的東西沿着

//p[@class='something')]//text()

線，但這會導致文本的每個塊作爲一個單獨的結果元素結束了，像這樣：

(This class has some ,text, and a few ,links, in it.)

所需的輸出將包含所有文本在一個元素中，像這樣：

This class has some text and a few links in it.

是否有一種簡單或優雅的方式來實現這一目標？

編輯：下面是生成上面給出結果的代碼。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']//text()" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item)

來源

2015-04-01 Yuka

什麼HTML解析庫您使用？ – alecxe 2015-04-01 19:03:34

我正在使用lxml，我已經更新了這個問題。 – Yuka 2015-04-01 19:10:38

你可以稱之爲.text_content()上lxml的元素，而不是獲取使用XPath的文本。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item.text_content())

來源

2015-04-01 19:49:07

您可以在XPath中使用normalize-space()。然後

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 
xpath_query = "normalize-space(//p[@class='something'])" 

tree = html.fromstring(html_snippet) 
print tree.xpath(xpath_query)

將產生

This class has some text and a few links in it.

來源

2015-04-01 19:49:01 kjhughes

您的原始代碼的替代一行程序：使用join一個空字符串分隔符：

print("".join(query_results))

來源

2015-04-01 19:50:39 bjimba

如何在沒有HTML標記的情況下選擇文本

回答

相關問題