Scraperwiki字符編碼異常

下面是一個ScraperWiki刮板用Python寫的：Scraperwiki字符編碼異常

import lxml.html 
import scraperwiki 
from unidecode import unidecode 

html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200") 
root = lxml.html.fromstring(html) 
for tr in root.cssselect("table.ranking tr"): 
    if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0: 
     university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title() 
     if 'cole' in university: 
      print university

它產生以下輸出：

Ecole Polytechnique Federale De Lausanne 
Ecole Normale Superieure 
Acole Polytechnique 
Ecole Normale Superieure De Lyon

我的問題：是什麼原因造成的第三輸出線的初始字符被渲染爲「A」而不是「E」，以及如何阻止這種情況發生？

來源

2013-05-07 sampablokuper

有現身爲高等的那些和一個出來之間的差異作爲Acole。 Ecole的實際上是'＆Eacute; cole'，而其中最突出的是'ÉcolePolytechnique'，即不是HTML實體。中斷可能發生在'lxml'或'unidecode'中。還要確保你的終端支持正確的編碼。 – soulseekah 2013-05-07 19:37:55

你是對的。奇怪的是，Firefox檢查員沒有顯示出這種差異。現在試圖找出解決方案。順便說一句，如果你想把你的評論變成一個答案，我會很高興地讚揚它（如果它回答了我的問題的第二部分，那麼我當然也很樂意將它解決）。 – sampablokuper 2013-05-07 19:45:04

基於以上soulseekah的有益評論，對LXML文檔here和here，以下解決方案的工作原理：

import lxml.html 
import scraperwiki 
from unidecode import unidecode 
from BeautifulSoup import UnicodeDammit 

def decode_html(html_string): 
    converted = UnicodeDammit(html_string, isHTML=True) 
    if not converted.unicode: 
     raise UnicodeDecodeError(
      "Failed to detect encoding, tried [%s]", 
      ', '.join(converted.triedEncodings)) 
    return converted.unicode 

html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200") 
root = lxml.html.fromstring(decode_html(html)) 
for tr in root.cssselect("table.ranking tr"): 
    if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0: 
     university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title() 
     if 'cole' in university: 
      print university

來源

2013-05-07 19:53:19 sampablokuper

Scraperwiki字符編碼異常

回答

相關問題