2015-04-01 88 views
2

我正在使用一個網頁抓取工具(使用Python),所以我有一大塊HTML,我試圖從中提取文本。其中一個代碼片段如下所示:如何在沒有HTML標記的情況下選擇文本

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p> 

我想從該類中提取文本。現在,我可以用的東西沿着

//p[@class='something')]//text() 

線,但這會導致文本的每個塊作爲一個單獨的結果元素結束了,像這樣:

(This class has some ,text, and a few ,links, in it.) 

所需的輸出將包含所有文本在一個元素中,像這樣:

This class has some text and a few links in it. 

是否有一種簡單或優雅的方式來實現這一目標?

編輯:下面是生成上面給出結果的代碼。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']//text()" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item) 
+0

什麼HTML解析庫您使用? – alecxe 2015-04-01 19:03:34

+0

我正在使用lxml,我已經更新了這個問題。 – Yuka 2015-04-01 19:10:38

回答

1

你可以稱之爲.text_content()上lxml的元素,而不是獲取使用XPath的文本。

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 

xpath_query = "//p[@class='something']" 

tree = html.fromstring(html_snippet) 
query_results = tree.xpath(xpath_query) 
for item in query_results: 
    print "'{0}'".format(item.text_content()) 
3

您可以在XPath中使用normalize-space()。然後

from lxml import html 

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>' 
xpath_query = "normalize-space(//p[@class='something'])" 

tree = html.fromstring(html_snippet) 
print tree.xpath(xpath_query) 

將產生

This class has some text and a few links in it. 
0

您的原始代碼的替代一行程序:使用join一個空字符串分隔符:

print("".join(query_results)) 
相關問題