用美麗的湯編碼Emojis

尋求一些幫助。我正在研究一個項目，使用Python中的Beautiful Soup來抓取具體的Craigslist帖子。我可以成功顯示在帖子標題中發現的emojis，但在帖子正文中未成功。我嘗試了不同的變化，但迄今爲止沒有任何工作。任何幫助，將不勝感激。用美麗的湯編碼Emojis

代碼：從身體收到

f = open("clcondensed.txt", "w") 
html2 = requests.get("https://raleigh.craigslist.org/wan/6078682335.html") 
soup = BeautifulSoup(html2.content,"html.parser") 
#Post Title 
title = soup.find(id="titletextonly")  
title1 = soup.title.string.encode("ascii","xmlcharrefreplace") 
f.write(title1) 
#Post Body 
body = soup.find(id="postingbody")   
body = str(body) 
body = body.encode("ascii","xmlcharrefreplace") 
f.write(body)

錯誤：

'ascii' codec can't decode byte 0xef in position 273: ordinal not in range(128)

來源

2017-04-07 Phil21

可能與此類似：http://stackoverflow.com/questions/9644099/python-ascii-codec-cant-decode-byte – anonyXmous

您應該使用unicode

body = unicode(body)

請參閱美麗的湯文檔NavigableString

更新：

對不起，我快速解答。這是不對的。

這裏你應該用lxml解析器，而不是html解析器，因爲html分析器沒有爲NCR (Numeric Character Reference)表情符號支援很好。

在我的測試中，當NCR的表情符號十進制值大於65535，更大的作爲你的HTML演示的表情符號🚢這樣，HTML解析器只是錯誤的Unicode \ufffd比u"\U0001F6A2"解碼。我無法找到準確的Beautiful Soup reference，但lxml解析器正常。

下面是測試代碼：

import requests 
from bs4 import BeautifulSoup 
f = open("clcondensed.txt", "w") 
html = requests.get("https://raleigh.craigslist.org/wan/6078682335.html") 
soup = BeautifulSoup(html.content, "lxml") 
#Post Title 
title = soup.find(id="titletextonly") 
title = unicode(title) 
f.write(title.encode('utf-8')) 
#Post Body 
body = soup.find(id="postingbody") 
body = unicode(body) 
f.write(body.encode('utf-8')) 
f.close()

您可以參考lxml entity handling做更多的事情。

如果您不安裝lxml，只需參考lxml installing。

希望得到這個幫助。

來源

2017-04-08 12:06:50 Fogmoon

感謝您的幫助和鏈接的參考。按預期工作。非常感激！ – Phil21

@ Phil21我很高興幫忙 – Fogmoon

用美麗的湯編碼Emojis

回答

相關問題