2011-05-20 203 views
1

以下代碼允許我提取.tgz文件。然而,它在大約兩級之後停止提取;還有其他的子文件夾需要提取.tgz文件。此外,當我提取一個文件時,我必須手動將它移動到另一個路徑,否則它會被其他提取到該位置的.tgz文件覆蓋(我使用的所有.tgz文件都具有相同的文件結構/文件夾名稱一旦提取)。任何幫助表示讚賞。謝謝!提取壓縮文件

import os, sys, tarfile 

def extract(tar_url, extract_path='.'): 
    print tar_url 
    tar = tarfile.open(tar_url, 'r') 
    for item in tar: 
     tar.extract(item, extract_path) 
     if item.name.find(".tgz") != -1 or item.name.find(".tar") != -1: 
      extract(item.name, "./" + item.name[:item.name.rfind('/')]) 
try: 

    extract(sys.argv[1] + '.tgz') 
    print 'Done.' 
except: 
    name = os.path.basename(sys.argv[0]) 
    print name[:name.rfind('.')], '<filename>' 
+4

跳轉首先想到的事情是,你打電話提取物(事實)遞歸而不關閉tar文件打開,所以你可以打開太多文件。我會重寫一個列表作爲一個堆棧,您可以將發現的tar文件放入並關閉每個tar文件,然後再從堆棧中取出下一個並處理它。 – 2011-05-20 20:32:13

+0

第二件事是你傳遞了錯誤的'extract_path'。使用'os.path.join(extract_path,item.name ....)'。 – khachik 2011-05-20 20:36:41

+2

第三件事是,你使用的是「除外」,所以即使它提出異常來說出現問題,也沒有機會報告它。使用try ...除外,具體說明您正在捕捉哪個異常。 – MRAB 2011-05-20 22:50:32

回答

3

如果我沒有錯誤解你的問題,那麼這裏就是你想做的事 -

  • 提取物可能有內它 更.tgz的文件.tgz的文件,需要進一步 提取(等等..)
  • 提取時,您需要小心不要替換文件夾中已有的目錄。

如果我正確理解你的問題,然後...
這裏是我的代碼做 -

  • 提取物每.tgz的文件(遞歸)在一個單獨的文件夾名稱相同.tgz文件(沒有擴展名)放在同一個目錄下。
  • 提取時,它確保它不覆蓋/替換任何已經存在的文件/文件夾。

因此,如果這是.tgz的文件的目錄結構 -

parent/ 
    xyz.tgz/ 
     a 
     b 
     c 
     d.tgz/ 
      x 
      y 
      z 
     a.tgz/     # note if I extract this directly, it will replace/overwrite contents of the folder 'a' 
      m 
      n 
      o 
      p 

提取之後,目錄結構將是 -

parent/ 
    xyz.tgz 
    xyz/ 
     a 
     b 
     c 
     d/ 
      x 
      y 
      z 
     a 1/     # it extracts 'a.tgz' to the folder 'a 1' as folder 'a' already exists in the same folder. 
      m 
      n 
      o 
      p 

雖然我已經提供了大量的文檔我的代碼如下,我只是簡單介紹一下我的程序結構。這裏是我定義的功能 -

FileExtension --> returns the extension of a file 
AppropriateFolderName --> helps in preventing overwriting/replacing of already existing folders (how? you will see it in the program) 
Extract --> extracts a .tgz file (safely) 
WalkTreeAndExtract - walks down a directory (passed as parameter) and extracts all .tgz files(recursively) on the way down. 

我不能建議您所做的更改,因爲我的方法有點不同。我已經使用extractall方法的tarfile模塊,而不是像以前那樣複雜的方法extract方法。 (只要有瀏覽一下這個 - 。http://docs.python.org/library/tarfile.html#tarfile.TarFile.extractall和閱讀使用extractall方法相關的警告,我不`噸認爲我們將有任何一般這樣的問題,而只是記住這一點)

所以這裏是代碼這爲我工作 -
(我試了.tar文件嵌套5級深度(即在.tar.tar.tar ... 5次),但它應該對任何深入的工作*,也爲.tgz文件。)

# extracting_nested_tars.py 

import os 
import re 
import tarfile 

file_extensions = ('tar', 'tgz') 
# Edit this according to the archive types you want to extract. Keep in 
# mind that these should be extractable by the tarfile module. 

def FileExtension(file_name): 
    """Return the file extension of file 

    'file' should be a string. It can be either the full path of 
    the file or just its name (or any string as long it contains 
    the file extension.) 

    Examples: 
    input (file) --> 'abc.tar' 
    return value --> 'tar' 

    """ 
    match = re.compile(r"^.*[.](?P<ext>\w+)$", 
     re.VERBOSE|re.IGNORECASE).match(file_name) 

    if match:   # if match != None: 
     ext = match.group('ext') 
     return ext 
    else: 
     return ''  # there is no file extension to file_name 

def AppropriateFolderName(folder_name, parent_fullpath): 
    """Return a folder name such that it can be safely created in 
    parent_fullpath without replacing any existing folder in it. 

    Check if a folder named folder_name exists in parent_fullpath. If no, 
    return folder_name (without changing, because it can be safely created 
    without replacing any already existing folder). If yes, append an 
    appropriate number to the folder_name such that this new folder_name 
    can be safely created in the folder parent_fullpath. 

    Examples: 
    folder_name = 'untitled folder' 
    return value = 'untitled folder' (if no such folder already exists 
             in parent_fullpath.) 

    folder_name = 'untitled folder' 
    return value = 'untitled folder 1' (if a folder named 'untitled folder' 
             already exists but no folder named 
             'untitled folder 1' exists in 
             parent_fullpath.) 

    folder_name = 'untitled folder' 
    return value = 'untitled folder 2' (if folders named 'untitled folder' 
             and 'untitled folder 1' both 
             already exist but no folder named 
             'untitled folder 2' exists in 
             parent_fullpath.) 

    """ 
    if os.path.exists(os.path.join(parent_fullpath,folder_name)): 
     match = re.compile(r'^(?P<name>.*)[ ](?P<num>\d+)$').match(folder_name) 
     if match:       # if match != None: 
      name = match.group('name') 
      number = match.group('num') 
      new_folder_name = '%s %d' %(name, int(number)+1) 
      return AppropriateFolderName(new_folder_name, 
             parent_fullpath) 
      # Recursively call itself so that it can be check whether a 
      # folder named new_folder_name already exists in parent_fullpath 
      # or not. 
     else: 
      new_folder_name = '%s 1' %folder_name 
      return AppropriateFolderName(new_folder_name, parent_fullpath) 
      # Recursively call itself so that it can be check whether a 
      # folder named new_folder_name already exists in parent_fullpath 
      # or not. 
    else: 
     return folder_name 

def Extract(tarfile_fullpath, delete_tar_file=True): 
    """Extract the tarfile_fullpath to an appropriate* folder of the same 
    name as the tar file (without an extension) and return the path 
    of this folder. 

    If delete_tar_file is True, it will delete the tar file after 
    its extraction; if False, it won`t. Default value is True as you 
    would normally want to delete the (nested) tar files after 
    extraction. Pass a False, if you don`t want to delete the 
    tar file (after its extraction) you are passing. 

    """ 
    tarfile_name = os.path.basename(tarfile_fullpath) 
    parent_dir = os.path.dirname(tarfile_fullpath) 

    extract_folder_name = AppropriateFolderName(tarfile_name[:\ 
    -1*len(FileExtension(tarfile_name))-1], parent_dir) 
    # (the slicing is to remove the extension (.tar) from the file name.) 
    # Get a folder name (from the function AppropriateFolderName) 
    # in which the contents of the tar file can be extracted, 
    # so that it doesn't replace an already existing folder. 
    extract_folder_fullpath = os.path.join(parent_dir, 
    extract_folder_name) 
    # The full path to this new folder. 

    try: 
     tar = tarfile.open(tarfile_fullpath) 
     tar.extractall(extract_folder_fullpath) 
     tar.close() 
     if delete_tar_file: 
      os.remove(tarfile_fullpath) 
     return extract_folder_name 
    except Exception as e: 
     # Exceptions can occur while opening a damaged tar file. 
     print 'Error occured while extracting %s\n'\ 
     'Reason: %s' %(tarfile_fullpath, e) 
     return 

def WalkTreeAndExtract(parent_dir): 
    """Recursively descend the directory tree rooted at parent_dir 
    and extract each tar file on the way down (recursively). 
    """ 
    try: 
     dir_contents = os.listdir(parent_dir) 
    except OSError as e: 
     # Exception can occur if trying to open some folder whose 
     # permissions this program does not have. 
     print 'Error occured. Could not open folder %s\n'\ 
     'Reason: %s' %(parent_dir, e) 
     return 

    for content in dir_contents: 
     content_fullpath = os.path.join(parent_dir, content) 
     if os.path.isdir(content_fullpath): 
      # If content is a folder, walk it down completely. 
      WalkTreeAndExtract(content_fullpath) 
     elif os.path.isfile(content_fullpath): 
      # If content is a file, check if it is a tar file. 
      # If so, extract its contents to a new folder. 
      if FileExtension(content_fullpath) in file_extensions: 
       extract_folder_name = Extract(content_fullpath) 
       if extract_folder_name:  # if extract_folder_name != None: 
        dir_contents.append(extract_folder_name) 
        # Append the newly extracted folder to dir_contents 
        # so that it can be later searched for more tar files 
        # to extract. 
     else: 
      # Unknown file type. 
      print 'Skipping %s. <Neither file nor folder>' % content_fullpath 

if __name__ == '__main__': 
    tarfile_fullpath = 'fullpath_path_of_your_tarfile' # pass the path of your tar file here. 
    extract_folder_name = Extract(tarfile_fullpath, False) 

    # tarfile_fullpath is extracted to extract_folder_name. Now descend 
    # down its directory structure and extract all other tar files 
    # (recursively). 
    extract_folder_fullpath = os.path.join(os.path.dirname(tarfile_fullpath), 
     extract_folder_name) 
    WalkTreeAndExtract(extract_folder_fullpath) 
    # If you want to extract all tar files in a dir, just execute the above 
    # line and nothing else. 

我還沒有添加命令行界面。我想你可以添加它,如果你覺得它有用。

這裏有一個稍微好一點的版本,上述程序 -
http://guanidene.blogspot.com/2011/06/nested-tar-archives-extractor.html

+0

而不是'os.listdir()',使用'os.walk()'來遍歷目錄樹。 Re:「(分片是從文件名中刪除擴展名(.tar)。)」使用'os.path.splitext(tarfile_name)[0]' – hughdbrown 2011-06-25 15:00:45

+0

我不能這樣做--'dir_contents.append( extract_folder_name)'(並且另外定製它)如果我使用'os.walk()'。 關於使用'os.path.splitext',在'.tgz'文件的情況下是正確的,但是我已經寫了一個更通用的目的 - 提取'.tar.gz'和'.tar.bz2 '文件(擴展名拼寫錯誤地給出了'.gz'和'.bz2')。 – 2011-06-25 15:08:22

+0

這就是我正在嘗試做的......謝謝! – suffa 2011-06-28 01:31:33