Scrapy覆蓋json文件而不是附加文件

實施例）

scrapy crawl myspider -o "/path/to/json/my.json" -t json  
scrapy crawl myspider -o "/path/to/json/my.json" -t json

將追加my.json文件，而不是覆蓋它。

來源

2015-10-15 hooliooo

scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"

來源

2015-11-02 21:47:38 eLRuLL

謝謝！這是我正在尋找的。所以簡單的「 - >」部分覆蓋文件？ – hooliooo

-o - ：重定向到標準輸出，並>將標準輸出重定向到具有以下路徑的新文件。我用它並奇怪地工作，就像我得到無效的JSON輸出。 – miguelfg

當我在Docker容器內使用subprocess.check_output調用它時，爲什麼這不起作用？ ''''，'>'，''''，''''，''''，'''''''''''''''''''''' ，'-a'，'url = url.jpg]'返回非零退出狀態2 – 2017-05-18 09:27:37

這是Scrapy的一箇舊的well-known "problem"。每次你開始爬行，你不想保留以前調用的結果，你必須刪除文件。這背後的想法是，您想要在不同的時間範圍內抓取不同的網站或同一網站，以免意外丟失已收集的結果。這可能是不好的。

解決方案是編寫一個自己的物品管道，您可以在其中打開目標文件'w'而不是'a'。

要了解如何在文檔寫一個這樣的管道看看：http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline（專門爲JSON出口：http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file）

來源

2015-10-15 05:46:16 GHajba

我可以使用exporter.py腳本來做類似的事情嗎？我用我的編輯實例化一個自定義的JsonItemExporter類？（我還是一個新手程序員，所以我不知道我說的是否正確），然後添加self.file = open（file，'wb'）？我不確定這是否是正確的方式 – hooliooo

，因爲公認的答案給了我unvalid JSON的問題，這可能是工作：

find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"

來源

2016-05-27 11:19:39 miguelfg

爲了解決這個問題，我在myproject目錄中創建了一個scrapy.extensions.feedexport.FileFeedStorage的子類。

這是我customexport.py：

"""Custom Feed Exports extension.""" 
import os 

from scrapy.extensions.feedexport import FileFeedStorage 


class CustomFileFeedStorage(FileFeedStorage): 
    """ 
    A File Feed Storage extension that overwrites existing files. 

    See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79 
    """ 

    def open(self, spider): 
     """Return the opened file.""" 
     dirname = os.path.dirname(self.path) 
     if dirname and not os.path.exists(dirname): 
      os.makedirs(dirname) 
     # changed from 'ab' to 'wb' to truncate file when it exists 
     return open(self.path, 'wb')

然後我說下面我settings.py（參見：https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base）：

FEED_STORAGES_BASE = { 
    '': 'myproject.customexport.CustomFileFeedStorage', 
    'file': 'myproject.customexport.CustomFileFeedStorage', 
}

現在，每當我寫一個文件時，它就會被改寫，因爲這個。

來源

2016-11-26 21:06:03 robkorv

很好的解決方案。在settings.py中重新定義FEED_STORAGES_BASE是一個好主意嗎？在這種情況下，scrapy crawl命令仍然會面臨這個問題。 – hAcKnRoCk

我會稱之爲'OverwriteFileFeedStorage'。 – Suor

Scrapy覆蓋json文件而不是附加文件

回答

相關問題