我有span元素一些HTML文件：如何解決用西里爾文符號解析html文件的問題？

<html> 
<body> 
<span class="one">Text</span>some text</br> 
<span class="two">Привет</span>Текст на русском</br> 
</body> 
</html>

得到「一些文本」：

# -*- coding:cp1251 -*- 
import lxml 
from lxml import html 

filename = "t.html" 
fread = open(filename, 'r') 
source = fread.read() 

tree = html.fromstring(source) 
fread.close() 


tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK 
print "name: ",tags[0].text 
print "value: ",tags[0].tail 

tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False 

print "name: ",tags[0].text 
print "value: ",tags[0].tail

這個節目：

name: Text 
value: some text 

Traceback: ... in line `tags = tree.xpath('//span[@class="two" and text()="Привет"]')` 
    ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

如何解決這個問題呢？

來源

2010-11-15 HammerSpb

LXML

（如觀察到的，這是系統編碼之間有些冒險顯然無法在Windows XP下正常工作，儘管它在Linux中所做的那樣）

我得到了它的源字符串解碼工作 - tree = html.fromstring(source.decode('utf-8'))：

# -*- coding:cp1251 -*- 
import lxml 
from lxml import html 

filename = "t.html" 
fread = open(filename, 'r') 
source = fread.read() 

tree = html.fromstring(source.decode('utf-8')) 
fread.close() 


tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK 
print "name: ",tags[0].text 
print "value: ",tags[0].tail 

tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This is now OK too 

print "name: ",tags[0].text 
print "value: ",tags[0].tail

這意味着實際的樹全部是unicode對象。如果您只是將xpath參數設置爲unicode，則會找到0個匹配項。

BeautifulSoup

我更喜歡使用BeautifulSoup任何這類的東西，反正。這是我的互動會議;我將這個文件保存在cp1251中。

>>> from BeautifulSoup import BeautifulSoup 
>>> filename = '/tmp/cyrillic' 
>>> fread = open(filename, 'r') 
>>> source = fread.read() 
>>> source # Scary 
'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\xcf\xf0\xe8\xe2\xe5\xf2</span>\xd2\xe5\xea\xf1\xf2 \xed\xe0 \xf0\xf3\xf1\xf1\xea\xee\xec</br>\n</body>\n</html>\n' 
>>> source = source.decode('cp1251') # Let's try getting this right. 
u'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\u041f\u0440\u0438\u0432\u0435\u0442</span>\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c</br>\n</body>\n</html>\n' 
>>> soup = BeautifulSoup(source) 
>>> soup # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning. 
<html> 
<body> 
<span class="one">Text</span>some text 
<span class="two">Привет</span>Текст на русском 
</body> 
</html> 

>>> soup.find('span', 'one').findNextSibling(text=True) 
u'some text' 
>>> soup.find('span', 'two').findNextSibling(text=True) # This looks a bit daunting ... 
u'\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c' 
>>> print _ # ... but it's not, really. Just Unicode chars. 
Текст на русском 
>>> # Then you may also wish to get things by text: 
>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True) 
Текст на русском 
>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.

在那年底，它可能是值得考慮嘗試source.decode('cp1251')代替source.decode('utf-8')當你從文件系統中服用。 lxml可能實際上工作。

來源

2010-11-15 03:40:11

它也不起作用，我試着在Windows XP下運行， – HammerSpb 2010-11-15 09:49:52

我在Linux上做過，請繼續，我將啓動我的XP虛擬機，看看我能不能在XP上找到它 – 2010-11-15 09:56:12

謝謝Chris！在XP下這是ANSI文件 – HammerSpb 2010-11-15 10:03:05

沒有測試，但在Unicode包裝調用tags[0].tail（）內置函數應該這樣做：unicode(tags[0].tail)

來源

2010-11-15 02:40:08 jonesy

在這條線的問題：標籤= tree.xpath（'// span [@ class =「two」and text（）=「Привет」]'） – HammerSpb 2010-11-15 02:44:07

好吧，那麼'text（）= u「Привет」'如果那不行這樣做'text（）= unicode（「Привет」）' – jonesy 2010-11-15 02:47:46

我試過這些選項。結果相同。我的ASCII格式的html文件和ASCII格式的python腳本。我試圖將其轉換爲UTF-8，但什麼都不做。（（也許我不理解一些編解碼器的路由？） – HammerSpb 2010-11-15 02:54:26

我得到了與lxml生成XML相同的錯誤。這裏找到解決方法：http://lethain.com/stripping-illegal-characters-from-xml-in-python/

我只是做：

remove_re = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]') 
etree_sub_el.text = remove_re.sub('', text)

來源

2011-04-18 15:29:05 Alerion

試試這個

tree = html.fromstring(source.decode('utf-8'))

和

tags = tree.xpath('//span[@class="two" and text()="%s"]' % u'Привет')

來源

2012-10-24 08:59:01 zhmyh

如何解決用西里爾文符號解析html文件的問題？

回答

LXML

BeautifulSoup

相關問題