下載，解壓並在Python中讀取gzip文件

我想在Python中下載，提取和遍歷文本文件，而無需創建臨時文件。下載，解壓並在Python中讀取gzip文件

基本上，這條管道，但是在Python

curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step

這裏是我的代碼：

def main(): 
    import urllib 
    import gzip 

    # Download SEED database 
    print 'Downloading SEED Database' 
    handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz') 


    with open('SEED.fasta.gz', 'wb') as out: 
     while True: 
      data = handle.read(1024) 
      if len(data) == 0: break 
      out.write(data) 

    # Extract SEED database 
    handle = gzip.open('SEED.fasta.gz') 
    with open('SEED.fasta', 'w') as out: 
     for line in handle: 
      out.write(line) 

    # Filter SEED database 
    pass

我不想使用process.Popen（）或任何東西，因爲我想這個腳本與平臺無關。

問題是，Gzip庫只接受文件名作爲參數而不處理。「管道」的原因是下載步驟只使用了大約5％的CPU，並且同時運行提取和處理會更快。

編輯：這不會起作用，因爲

「因爲這樣gzip壓縮作品，GzipFile中需要保存其位置，並通過壓縮向前向後移動並文件當「文件」是來自遠程服務器的字節流時，這不起作用;您只能使用 retri每次前進一個字節，不會通過數據流來回移動。「 - dive into python

這就是爲什麼我得到的錯誤

AttributeError: addinfourl instance has no attribute 'tell'

那麼，如何curl url | gunzip | whatever工作？

來源

2010-08-23 Austin Richardson

爲什麼不在單獨的Python文件中？ 'python download.py | python extract.py | python filter.py'？ – 2010-08-23 14:33:50

因爲從python腳本中執行系統命令執行python腳本很麻煩。另外，我說我希望這是平臺獨立的（意味着那些使用Windows的人不會有任何問題），並且執行系統命令會使得這很困難。 DOS甚至支持管道？ – 2010-08-23 15:36:06

只需gzip.GzipFile(fileobj=handle)，你就會在你的路上 - 換句話說，「Gzip庫只接受文件名作爲參數而不處理」並不是真的，你只需要使用fileobj=命名參數。

來源

2010-08-23 14:41:21

謝謝！在文件中沒有看到。 – 2010-08-23 15:21:50

@奧斯汀，不客氣！ – 2010-08-23 15:28:00

請記住文件對象必須支持'seek'。 – Andrey 2015-02-05 17:54:43

下載，解壓並在Python中讀取gzip文件

回答

相關問題