2010-11-15 83 views
3

我有span元素一些HTML文件:如何解決用西里爾文符號解析html文件的問題?

<html> 
<body> 
<span class="one">Text</span>some text</br> 
<span class="two">Привет</span>Текст на русском</br> 
</body> 
</html> 

得到 「一些文本」:

# -*- coding:cp1251 -*- 
import lxml 
from lxml import html 

filename = "t.html" 
fread = open(filename, 'r') 
source = fread.read() 

tree = html.fromstring(source) 
fread.close() 


tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK 
print "name: ",tags[0].text 
print "value: ",tags[0].tail 

tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False 

print "name: ",tags[0].text 
print "value: ",tags[0].tail 

這個節目:

name: Text 
value: some text 

Traceback: ... in line `tags = tree.xpath('//span[@class="two" and text()="Привет"]')` 
    ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes 

如何解決這個問題呢?

回答

4

LXML

(如觀察到的,這是系統編碼之間有些冒險顯然無法在Windows XP下正常工作,儘管它在Linux中所做的那樣)

我得到了它的源字符串解碼工作 - tree = html.fromstring(source.decode('utf-8'))

# -*- coding:cp1251 -*- 
import lxml 
from lxml import html 

filename = "t.html" 
fread = open(filename, 'r') 
source = fread.read() 

tree = html.fromstring(source.decode('utf-8')) 
fread.close() 


tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK 
print "name: ",tags[0].text 
print "value: ",tags[0].tail 

tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This is now OK too 

print "name: ",tags[0].text 
print "value: ",tags[0].tail 

這意味着實際的樹全部是unicode對象。如果您只是將xpath參數設置爲unicode,則會找到0個匹配項。

BeautifulSoup

我更喜歡使用BeautifulSoup任何這類的東西,反正。這是我的互動會議;我將這個文件保存在cp1251中。

>>> from BeautifulSoup import BeautifulSoup 
>>> filename = '/tmp/cyrillic' 
>>> fread = open(filename, 'r') 
>>> source = fread.read() 
>>> source # Scary 
'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\xcf\xf0\xe8\xe2\xe5\xf2</span>\xd2\xe5\xea\xf1\xf2 \xed\xe0 \xf0\xf3\xf1\xf1\xea\xee\xec</br>\n</body>\n</html>\n' 
>>> source = source.decode('cp1251') # Let's try getting this right. 
u'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\u041f\u0440\u0438\u0432\u0435\u0442</span>\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c</br>\n</body>\n</html>\n' 
>>> soup = BeautifulSoup(source) 
>>> soup # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning. 
<html> 
<body> 
<span class="one">Text</span>some text 
<span class="two">Привет</span>Текст на русском 
</body> 
</html> 

>>> soup.find('span', 'one').findNextSibling(text=True) 
u'some text' 
>>> soup.find('span', 'two').findNextSibling(text=True) # This looks a bit daunting ... 
u'\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c' 
>>> print _ # ... but it's not, really. Just Unicode chars. 
Текст на русском 
>>> # Then you may also wish to get things by text: 
>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True) 
Текст на русском 
>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation. 

在那年底,它可能是值得考慮嘗試source.decode('cp1251')代替source.decode('utf-8')當你從文件系統中服用。 lxml可能實際上工作。

+0

它也不起作用,我試着在Windows XP下運行, – HammerSpb 2010-11-15 09:49:52

+0

我在Linux上做過,請繼續,我將啓動我的XP虛擬機,看看我能不能在XP上找到它 – 2010-11-15 09:56:12

+0

謝謝Chris!在XP下這是ANSI文件 – HammerSpb 2010-11-15 10:03:05

0

沒有測試,但在Unicode包裝調用tags[0].tail()內置函數應該這樣做:unicode(tags[0].tail)

+0

在這條線的問題:標籤= tree.xpath('// span [@ class =「two」and text()=「Привет」]') – HammerSpb 2010-11-15 02:44:07

+0

好吧,那麼'text()= u「Привет」'如果那不行這樣做'text()= unicode(「Привет」)' – jonesy 2010-11-15 02:47:46

+0

我試過這些選項。 結果相同。我的ASCII格式的html文件和ASCII格式的python腳本。我試圖將其轉換爲UTF-8,但什麼都不做。 ((也許我不理解一些編解碼器的路由?) – HammerSpb 2010-11-15 02:54:26

4

試試這個

tree = html.fromstring(source.decode('utf-8')) 

tags = tree.xpath('//span[@class="two" and text()="%s"]' % u'Привет')