使用BeautifulSoup無法正確顯示字符

我想使用BeautifulSoup庫從網站上刮取一些定居點的名稱。該網站使用'windows-1250'字符集，但某些字符顯示不正確。查看解決方案的姓氏，這應該是Župkov。使用BeautifulSoup無法正確顯示字符

你能幫我解決這個問題嗎？這是代碼：

# imports  
import requests 
from bs4 import BeautifulSoup 
from bs4 import NavigableString 

# create beautifulsoup object 
obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500' 
source_code = requests.get(obce_url) 
plain_text = source_code.text 
obce_soup = BeautifulSoup(plain_text, 'html.parser') 

# define bs filter 
def soup_filter_1(tag): 
    return tag.has_attr('href') and len(tag.attrs) == 1 and isinstance(tag.next_element, NavigableString) 

# print settlement names 
for tag in obce_soup.find_all(soup_filter_1): 
    print(tag.string)

我使用Python 3.5.1和4.4.1 beautifulsoup。

來源

2016-11-27 user21816

的問題不在於beautifulsoup，它只是不能確定你有什麼編碼（試print('encoding', obce_soup.original_encoding)），這是由於您將Unicode傳遞給Unicode而不是字節。

如果你試試這個：

obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500' 
source_code = requests.get(obce_url) 
data_bytes = source_code.content # don't use .text it will try to make Unicode 
obce_soup = BeautifulSoup(data_bytes, 'html.parser') 
print('encoding', obce_soup.original_encoding)

創建您beautifulsoup對象時，你會看到它現在得到的編碼權利，你的輸出確定。

來源

2016-11-27 08:57:15 Anthon

謝謝你的回答。它按預期工作。 – user21816

服務器可能會發送有關UTF-8的HTTP標頭信息，但HTML使用Win-1250。所以requests使用UTF-8解碼數據。

但是你可以得到oryginal數據source_code.content並使用decode('cp1250')得到正確的字符。

plain_text = source_code.content.decode('cp1250')

或者你也可以手動設置encoding你text

source_code.encoding = 'cp1250' 

plain_text = source_code.text

之前，您還可以使用oryginal數據BSsource_code.content所以它應該使用有關編碼的HTML信息

obce_soup = BeautifulSoup(source_code.content, 'html.parser')

看到

print(obce_soup.declared_html_encoding)

來源

2016-11-27 08:43:55 furas

謝謝你的回答。它運作良好。 – user21816

既然你知道網站的編碼，你可以通過它明確地BeautifulSoup構造與響應的內容，而不是文本：

source_code = requests.get(obce_url) 
content = source_code.content 
obce_soup = BeautifulSoup(content, 'html.parser', from_encoding='windows-1250')

來源

2016-11-27 08:49:49 valignatev

使用BeautifulSoup無法正確顯示字符

回答

相關問題