處理來自損壞的GZ（TAR）的單個文件提取

這是我在Stack Overflow上的第一篇文章，我有一個關於使用GZ壓縮從TAR文件中提取單個文件的問題。我不是最好的Python，所以我可能會這樣做不正確，任何幫助將不勝感激。處理來自損壞的GZ（TAR）的單個文件提取

場景：

損壞* .tar.gz文件進來，在廣州的第一個文件包含了獲取系統的SN的重要信息。這可以用來識別機器，以便我們可以向管理員發送文件已損壞的通知。

的問題：

使用常規的UNIX焦油二元我能提取剛剛從歸檔中的README文件，即使檔案是不完整的，在充分提取它會返回一個錯誤。但是，在Python中，我無法僅提取一個文件，即使我只指定單個文件，它也會返回一個異常。

目前的解決方法：

我使用「os.popen」使用UNIX焦油二進制爲了獲得公正的README文件。

期望解：

使用Python tar文件包只提取單個文件。

例錯誤：

UNIX（工程）：

[[email protected] tmp]# tar -xvzf bundle.tar.gz README 
README 

gzip: stdin: unexpected end of file 
tar: Unexpected EOF in archive 
tar: Error is not recoverable: exiting now 
[[email protected] tmp]# 
[[email protected] tmp]# ls 
bundle.tar.gz README

的Python：

>>> import tarfile 
>>> tar = tarfile.open("bundle.tar.gz") 
>>> data = tar.extractfile("README").read() 
Traceback (most recent call last): 
    File "<stdin>", line 1, in ? 
    File "/usr/lib64/python2.4/tarfile.py", line 1364, in extractfile 
    tarinfo = self.getmember(member) 
    File "/usr/lib64/python2.4/tarfile.py", line 1048, in getmember 
    tarinfo = self._getmember(name) 
    File "/usr/lib64/python2.4/tarfile.py", line 1762, in _getmember 
    members = self.getmembers() 
    File "/usr/lib64/python2.4/tarfile.py", line 1059, in getmembers 
    self._load()  # all members, we first have to 
    File "/usr/lib64/python2.4/tarfile.py", line 1778, in _load 
    tarinfo = self.next() 
    File "/usr/lib64/python2.4/tarfile.py", line 1588, in next 
    self.fileobj.seek(self.offset) 
    File "/usr/lib64/python2.4/gzip.py", line 377, in seek 
    self.read(1024) 
    File "/usr/lib64/python2.4/gzip.py", line 225, in read 
    self._read(readsize) 
    File "/usr/lib64/python2.4/gzip.py", line 273, in _read 
    self._read_eof() 
    File "/usr/lib64/python2.4/gzip.py", line 309, in _read_eof 
    raise IOError, "CRC check failed" 
IOError: CRC check failed 
>>> print data 
Traceback (most recent call last): 
    File "<stdin>", line 1, in ? 
NameError: name 'data' is not defined

的Python（處理異常）：

>>> tar = tarfile.open("bundle.tar.gz") 
>>> try: 
...  data = tar.extractfile("README").read() 
... except: 
...  pass 
... 
>>> print(data) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in ? 
NameError: name 'data' is not defined

來源

2010-12-03 Ricky

查看tarfile.py代碼，extractfile調用最終調用getmembers的getmember。 getmembers掃描整個tar文件，當它遇到EOF/Corrupted時，gzip會吱吱作響。嘗試提供一個已經解壓縮的流，以便crc異常不會被提取出來。 – kevpie 2010-12-04 04:32:32

使用手動Unix方法，它看起來像gzip將文件解壓縮到斷點。

Python gzip（或tar）模塊一旦發現由於CRC校驗失敗而發現存在損壞的歸檔文件，就會立即退出。

只是一個想法，但你可以用gzip預處理損壞的存檔並重新壓縮它們以糾正CRC。

gunzip < damaged.tar.gz | gzip > corrected.tar.gz

這會給你一個corrected.tar.gz，它現在將包含所有的數據，直到存檔被破壞的地步。您現在應該可以使用python tar/gzip庫而不會收到CRC異常。

記住這個命令將取消gzip和gzip壓縮存檔，它的價格存儲IO和CPU時間，你不應該爲所有的檔案做。

爲了高效，只有在得到IOError：CRC校驗失敗異常的情況下才應該運行它。

來源

2010-12-03 23:12:36 SirMo

你可以這樣做 - 嘗試將gzip文件解壓縮爲臨時文件，然後嘗試從中提取魔術文件。在下面的例子中，我試圖讀取整個文件非常積極 - 取決於gzip數據的塊大小，您可能最多讀取128-256k的數據。我的直覺告訴我，gzip最多可以在64k塊中運行，但我沒有任何承諾。

此方法在內存中執行所有操作，無需中間文件/寫入磁盤，但它確實將全部解壓縮數據保留在內存中，所以......我不是在開玩笑地針對您的特定用途-案件。

#!/usr/bin/python 

import gzip 
import tarfile 
import StringIO 

# Depending on how your tar file is constructed, you might need to specify 
# './README' as your magic_file 

magic_file = 'README' 

f = gzip.open('corrupt', 'rb') 

t = StringIO.StringIO() 

try: 
    while 1: 
     block = f.read(1024) 
     t.write(block) 
except Exception as e: 
    print str(e) 
    print '%d bytes decompressed' % (t.tell()) 

t.seek(0) 
tarball = tarfile.TarFile.open(name=None, mode='r', fileobj=t) 

try: 
    magic_data = tarball.getmember(magic_file).tobuf() 
    # I didn't actually try this part, but in theory 
    # getmember returns a tarinfo object which you can 
    # use to extract the file 

    # search magic data for serial number or print out the 
    # file 
    print magic_data 
except Exception as e: 
    print e

來源

2010-12-04 14:06:37 synthesizerpatel

處理來自損壞的GZ（TAR）的單個文件提取

回答

相關問題