2014-11-08 40 views
2

我正在爲http://www.delfi.lt寫一個網站刮板(在Windows 8上使用lxml和py3k) - 目標是將某些信息輸出到.txt文件。由於網站位於立陶宛語,很顯然ASCII不能用作編碼,所以我試圖用UTF-8打印它。但是,並非所有的非ASCII字符都正確地打印到文件中。輸出不能正確顯示所有utf-8

一個例子是我得到DELFI Žinios > Dienos naujienos > Užsienyje而不是DELFI Žinios > Dienos naujienos > Užsienyje

這是據我已經與刮板得到:

from lxml import html 
import sys 

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data 
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string 
def accept_user_input(): 
    if len(sys.argv) < 2 or len(sys.argv) > 3: 
     raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.') 
    if len(sys.argv) == 2: 
     return [sys.argv[1], ''] 
    else: 
     return sys.argv[1:] 

def main(): 
    url, name = accept_user_input() 
    page = html.parse(url) 

    title = page.find('//h1[@itemprop="headline"]') 
    category = page.findall('//span[@itemprop="title"]') 

    with open('output.txt', encoding='utf-8', mode='w') as f: 
     f.write((title.text) + "\n") 
     f.write(' > '.join([x.text for x in category]) + '\n') 

if __name__ == "__main__": 
    main() 

一個例子來看:python scraper.py http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799所生成的文件名爲output.txt

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ 
DELFI Žinios > Dienos naujienos > Užsienyje 

,而不是

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ 
DELFI Žinios > Dienos naujienos > Užsienyje 

如何讓腳本輸出所有文本相關ctly?

回答

3

使用要求和beautifulSoup,讓請求處理使用.content作品的編碼對我來說:

import requests 
from bs4 import BeautifulSoup 

def main(): 
    url, name = "http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799","foo.csv" 
    r = requests.get(url) 

    page = BeautifulSoup(r.content) 

    title = page.find("h1",{"itemprop":"headline"}) 
    category = page.find_all("span",{"itemprop":"title"}) 
    print(title) 
    with open('output.txt', encoding='utf-8', mode='w') as f: 
     f.write((title.text) + "\n") 
     f.write(' > '.join([x.text for x in category]) + '\n') 

輸出:

Ukraina: separatistai siautėja, O. Turčynovas atnaujina mobilizacijąnaujausi susirėmimų vaizdo įrašai 
DELFI Žinios > Dienos naujienos > Užsienyje 

更改解析器編碼也可以工作:

parser = etree.HTMLParser(encoding="utf-8") 
page = html.parse(url,parser) 

因此,將您的代碼更改爲以下內容:

from lxml import html,etree 
import sys 

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data 
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string 
def accept_user_input(): 
    if len(sys.argv) < 2 or len(sys.argv) > 3: 
     raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.') 
    if len(sys.argv) == 2: 
     return [sys.argv[1], ''] 
    else: 
     return sys.argv[1:] 

def main(): 
    parser = etree.HTMLParser(encoding="utf-8") 
    page = html.parse(url,parser)) 

    title = page.find('//h1[@itemprop="headline"]') 
    category = page.findall('//span[@itemprop="title"]') 

    with open('output.txt', encoding='utf-8', mode='w') as f: 
     f.write((title.text) + "\n") 
     f.write(' > '.join([x.text for x in category]) + '\n') 

if __name__ == "__main__": 
    main()