輸出不能正確顯示所有utf-8

我正在爲http://www.delfi.lt寫一個網站刮板（在Windows 8上使用lxml和py3k） - 目標是將某些信息輸出到.txt文件。由於網站位於立陶宛語，很顯然ASCII不能用作編碼，所以我試圖用UTF-8打印它。但是，並非所有的非ASCII字符都正確地打印到文件中。輸出不能正確顯示所有utf-8

一個例子是我得到DELFI Å½inios > Dienos naujienos > UÅ¾sienyje而不是DELFI Žinios > Dienos naujienos > Užsienyje。

這是據我已經與刮板得到：

from lxml import html 
import sys 

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data 
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string 
def accept_user_input(): 
    if len(sys.argv) < 2 or len(sys.argv) > 3: 
     raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.') 
    if len(sys.argv) == 2: 
     return [sys.argv[1], ''] 
    else: 
     return sys.argv[1:] 

def main(): 
    url, name = accept_user_input() 
    page = html.parse(url) 

    title = page.find('//h1[@itemprop="headline"]') 
    category = page.findall('//span[@itemprop="title"]') 

    with open('output.txt', encoding='utf-8', mode='w') as f: 
     f.write((title.text) + "\n") 
     f.write(' > '.join([x.text for x in category]) + '\n') 

if __name__ == "__main__": 
    main()

一個例子來看：python scraper.py http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799所生成的文件名爲output.txt含

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ 
DELFI Å½inios > Dienos naujienos > UÅ¾sienyje

，而不是

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ 
DELFI Žinios > Dienos naujienos > Užsienyje

如何讓腳本輸出所有文本相關ctly？

來源

2014-11-08 Amndeep7

使用要求和beautifulSoup，讓請求處理使用.content作品的編碼對我來說：

import requests 
from bs4 import BeautifulSoup 

def main(): 
    url, name = "http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799","foo.csv" 
    r = requests.get(url) 

    page = BeautifulSoup(r.content) 

    title = page.find("h1",{"itemprop":"headline"}) 
    category = page.find_all("span",{"itemprop":"title"}) 
    print(title) 
    with open('output.txt', encoding='utf-8', mode='w') as f: 
     f.write((title.text) + "\n") 
     f.write(' > '.join([x.text for x in category]) + '\n')

輸出：

Ukraina: separatistai siautėja, O. Turčynovas atnaujina mobilizacijąnaujausi susirėmimų vaizdo įrašai 
DELFI Žinios > Dienos naujienos > Užsienyje

更改解析器編碼也可以工作：

parser = etree.HTMLParser(encoding="utf-8") 
page = html.parse(url,parser)

因此，將您的代碼更改爲以下內容：

from lxml import html,etree 
import sys 

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data 
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string 
def accept_user_input(): 
    if len(sys.argv) < 2 or len(sys.argv) > 3: 
     raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.') 
    if len(sys.argv) == 2: 
     return [sys.argv[1], ''] 
    else: 
     return sys.argv[1:] 

def main(): 
    parser = etree.HTMLParser(encoding="utf-8") 
    page = html.parse(url,parser)) 

    title = page.find('//h1[@itemprop="headline"]') 
    category = page.findall('//span[@itemprop="title"]') 

    with open('output.txt', encoding='utf-8', mode='w') as f: 
     f.write((title.text) + "\n") 
     f.write(' > '.join([x.text for x in category]) + '\n') 

if __name__ == "__main__": 
    main()

來源

2014-11-08 10:38:57

輸出不能正確顯示所有utf-8

回答

相關問題