如何從文檔中獲取西里爾字符串

我有休耕代碼：

import urllib 
from BeautifulSoup import BeautifulSoup 

page = urllib.urlopen("http://habrahabr.ru/") 
soup = BeautifulSoup(page.read()) 
for topic in soup.findAll(True, 'topic'): 
    print topic 
    print 
raw_input()

有網站，但蟒蛇顯示錯誤的字符上西里爾字。

對於這個問題的任何幫助，我都會很有幫助。

PS。

我改變

soup = BeautifulSoup(page.read())

到

soup = BeautifulSoup(page.read(), fromEncoding="utf-8")

，仍然沒有結果...

來源

2011-02-24 Mirgorod

HTML頁面上的數據在UTF-8編碼。看來您正在將其打印到您的控制檯，其中sys.stdout.encoding是cp1251。這說明了你所看到的垃圾。

下面是檢查前8個字節的第一個話題，用閒置的結果：

>>> raw = '\xd0\x90\xd0\xbb\xd0\xb3\xd0\xbe' 
>>> print raw.decode('utf8') 
Алго 
>>> print raw.decode('cp1251') 
РђР»РіРѕ 
>>>

來源

2011-02-24 22:40:52

抓住這個功能，但我什麼，我需要在我的例子中做？我試圖轉換 'page.read（）。decode（'utf8'）' 但沒有結果... – Mirgorod 2011-02-24 23:04:49

嗯，這很奇怪，但只有一個這些是正常顯示...其他項目是錯誤的字符。 .. – Mirgorod 2011-02-24 23:07:34

感謝您的幫助。

我解決問題與此代碼：在Django

print str(topic).decode('utf8')

來源

2011-02-24 23:44:54 Mirgorod

我非常懷疑'str（）'是必需的。 'print topic.decode（'utf8'）'應該就足夠了。 – 2011-02-25 01:05:20

我想在某些情況下它可能是必需的，因爲Python 3將unicode類型重命名爲str，而舊的str類型已被字節替換。 – str14821 2014-04-28 15:40:56

我解決這樣說：

from django.utils.encoding import force_unicode 
print ("%s" % force_unicode(topic, encoding='utf-8', strings_only=False, errors='strict'))

這樣你就可以從Django中

來源

2011-02-25 08:38:32

如何從文檔中獲取西里爾字符串

回答

相關問題