Scrapy從unicode轉換爲utf-8

我寫了一個簡單的腳本來從某個站點提取數據。按預期工作腳本，但我不會和輸出格式嬉戲
這裏是我的代碼Scrapy從unicode轉換爲utf-8

class ArticleSpider(Spider): 
    name = "article" 
    allowed_domains = ["example.com"] 
    start_urls = (
     "http://example.com/tag/1/page/1" 
    ) 

    def parse(self, response): 
     next_selector = response.xpath('//a[@class="next"]/@href') 
     url = next_selector[1].extract() 
     # url is like "tag/1/page/2" 
     yield Request(urlparse.urljoin("http://example.com", url)) 

     item_selector = response.xpath('//h3/a/@href') 
     for url in item_selector.extract(): 
      yield Request(urlparse.urljoin("http://example.com", url), 
         callback=self.parse_article) 

    def parse_article(self, response): 
     item = ItemLoader(item=Article(), response=response) 
     # here i extract title of every article 
     item.add_xpath('title', '//h1[@class="title"]/text()') 
     return item.load_item()

我不跟輸出嬉戲，是這樣的：

[scrapy] DEBUG：從刮> {'title'：[u'\ xa0'\ u0412 \ u041e \ u041e \ u0411 \ u0429 \ u0415- \ u0422 \ u041e \ u0421 \ u0412 \ u041e \ u0411 \ u041e \ u0414 \ u0410 \ u0417 \ u0410 \ u0410 \ u0410 \ u0441 \ u0415 \ u0422 \ u0421 \ u042f「']}

我想我需要使用自定義ItemLoader類，但我不知道如何。需要你的幫助。

TL; DR我需要的文本，通過Scrapy刮從的unicode轉換爲UTF-8

來源

2016-04-29 GriMel

這是重寫此方法改變這種行爲只是scrapy打印Unicode字符（西里爾文）。你要如何保存你的被刮掉的物品？一旦你保存了它，你會怎麼做？ Unicode問題通常取決於您使用什麼軟件查看unicode數據。 – Steve

後來我將它保存到postgresql數據庫（使用管道），但現在我運行它作爲'scrapy抓取文章-o file.json'，我在json文件中看到相同的輸出。不得不承認，我是Scrapy的新手，所以我很感謝任何批評者） – GriMel

相關：[Python字符串打印爲'[u'String']']（http://stackoverflow.com/a/36891685/4279） – jfs

正如你可以看到下面，這是沒有太大的Scrapy問題，但更多的Python本身。它也可以稍微被稱爲一個問題:)

$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya 

In [7]: print response.xpath('//h1/text()').extract_first() 
 "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ" 

In [8]: response.xpath('//h1/text()').extract_first() 
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'

你看到的是同一事物的兩個不同的表示 - 一個Unicode字符串。

我建議運行-L INFO爬行或將LOG_LEVEL='INFO'添加到您的settings.py以便不在控制檯中顯示此輸出。

一個令人討厭的事情是，當您保存爲JSON時，您將獲得轉義的unicode JSON例如

$ scrapy crawl example -L INFO -o a.jl

爲您提供：

$ cat a.jl 
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}

這是正確的，但它需要更多的空間和大多數應用程序處理同樣非轉義JSON。

添加在您的settings.py可以改變這種行爲的幾行：

from scrapy.exporters import JsonLinesItemExporter 
class MyJsonLinesItemExporter(JsonLinesItemExporter): 
    def __init__(self, file, **kwargs): 
     super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs) 

FEED_EXPORTERS = { 
    'jsonlines': 'myproject.settings.MyJsonLinesItemExporter', 
    'jl': 'myproject.settings.MyJsonLinesItemExporter', 
}

本質上講，我們要做的僅僅是設置ensure_ascii=False默認JSON項目出口商什麼。這可以防止逃跑。我希望有一個更簡單的方法來將參數傳遞給出口商，但我看不到任何消息，因爲它們使用默認參數在here左右初始化。無論如何，現在你的JSON文件有：

$ cat a.jl 
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}

這是更好看，同樣有效和更緊湊。

來源

2016-04-30 19:22:39 neverlastn

有2個影響顯示unicode字符串的獨立問題。

如果返回字符串列表，輸出文件會有一些問題，他們因爲它將使用ASCII編碼解碼器默認序列化列表元素。您可以解決如下，但它更適合使用extract_first()由@neverlastn
```
class Article(Item): 
    title = Field(serializer=lambda x: u', '.join(x)) 
```

再版（）方法的默認實現的建議將序列unicode字符串到他們逃脫\uxxxx版本。您可以在您的項目類

class Article(Item): 
    def __repr__(self): 
     data = self.copy() 
     for k in data.keys(): 
      if type(data[k]) is unicode: 
       data[k] = data[k].encode('utf-8') 
     return super.__repr__(data)

來源

2016-05-01 06:31:02

Scrapy從unicode轉換爲utf-8

回答

相關問題