2017-08-07 106 views
3

我已經用python與xpath一起編寫了一個腳本來從含有xml內容的站點中刮取鏈接。因爲我從來沒有使用過XML,所以我無法弄清楚我犯的錯誤。在此先感謝您提供解決方法。以下是我與努力:無法解析來自xml內容的鏈接

import requests 
from lxml import html 

response = requests.get("https://drinkup.london/sitemap.xml").text 
tree = html.fromstring(response) 
for item in tree.xpath('//div[@class="expanded"]//span[@class="text"]'): 
    print(item) 

XML內容中哪些鏈接是:

<div xmlns="http://www.w3.org/1999/xhtml" class="collapsible" id="collapsible4"><div class="expanded"><div class="line"><span class="button collapse-button"></span><span class="html-tag">&lt;url&gt;</span></div><div class="collapsible-content"><div class="line"><span class="html-tag">&lt;loc&gt;</span><span class="text">https://drinkup.london/</span><span class="html-tag">&lt;/loc&gt;</span></div></div><div class="line"><span class="html-tag">&lt;/url&gt;</span></div></div><div class="collapsed hidden"><div class="line"><span class="button expand-button"></span><span class="html-tag">&lt;url&gt;</span><span class="text">...</span><span class="html-tag">&lt;/url&gt;</span></div></div></div> 

在執行時拋出的錯誤下面給出:

value = etree.fromstring(html, parser, **kw) 
    File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593) 
    File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119053) 
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. 
+0

您正在將'response'變量分配給'requests.get'中的'text'屬性,它將是一個unicode字符串,因此是錯誤。使用'content'屬性代替'text' – peterfields

回答

2

切換到.content which returns bytes instead of .text which returns unicode

import requests 
from lxml import html 


response = requests.get("https://drinkup.london/sitemap.xml").content 
tree = html.fromstring(response) 
for item in tree.xpath('//url/loc/text()'): 
    print(item) 

請注意固定的XPath表達式。

+0

你真是太棒了,先生alecxe。每當我遇到麻煩時,你就在那裏。它像魔術一樣工作。非常感謝。 – SIM