如何閱讀的不同目錄中的txt文件的內容和重命名等文件，根據

我剛開始使用Python 3，衝進了以下問題：如何閱讀的不同目錄中的txt文件的內容和重命名等文件，根據

我從網上下載了我的論文的不同期刊的PDF文件的一個很好的協議，但他們都是以他們的DOI命名，而不是以「作者（年） - 標題」的格式。將文檔保存在不同的目錄，根據期刊的名稱和數量，例如：

/Journal 1/ 
    /Vol. 1/ 
     file1.pdf 
     file1.txt 
     file2.pdf 
     file2.txt 
     filen.pdf 
     filen.txt 
    /Vol. 2/ 
     file1.pdf 
     file1.txt 
/Journal 2/ 
    ...

因爲我不知道如何閱讀與Python中的PDF內容，我寫了一個很短的bash腳本，將PDF轉換爲簡單的TXT文件。 pdf和txt文件具有不同的文件擴展名。

我想重新命名所有的PDF文件，幸運的是每個文件的連續文本中都有一個字符串，我可以使用。該變量的字符串位於兩個靜態字符串之間：

"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name".

我如何使Python進入每個目錄，閱讀TXT/PDF內容，提取兩個固定字符串之間的變量字符串，然後重命名適當的PDF文件？

如果有人知道如何用Python 3做到這一點，我會非常感激。

來源

2015-07-20 Telefonmann

有些寬泛真的。涉及很多步驟。你究竟在哪一點卡住了？ – usr2564301

如果您在acrobat中打開PDF文件並在文件/屬性下查找，這些元數據字符串中是否包含這些文件？ –

不，它們不在元字符串中。我被困在循環目錄+所有文件，然後重命名文件。要找到我使用的字符串： '（blablablabla（*）blablablabla」，S） '進口re' 'S = blablablablaAUTHORblablabla'' '結果= re.search'' – Telefonmann

終於得到它的工作：

#__author__ = 'Telefonmann' 
# -*- coding: utf-8 -*- 

import os, re, ntpath, shutil 

for root, dirs, files in os.walk(os.getcwd()): 
    for file in files: # loops through directories and files 
     if file.endswith(('.txt')): # only processes txt files 
      full_path = ntpath.splitdrive(ntpath.join(root, file))[1] 
      # builds correct path under Win 7 (and probably other NT-systems 

      with open(full_path, 'r', encoding='utf-8') as f: 
       content = f.read().replace('\n', '') # remove newline 

       r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,') 
       m = r.search(content) 
       # finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics," 
       # also finds typos like "Journal ofQuantitative ..." 

       if m: 
        full_title = m.group(1) 

      print("full_title: {0}".format(full_title)) 
      full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names 
       .replace('>','') 
       .replace(':',' -') 
       .replace('"','') 
       .replace('/','') 
       .replace('\\','') 
       .replace('|','') 
       .replace('?','') 
       .replace('*','')) 

      pdf_name = full_path.replace('txt','pdf') 
      # since txt and pdf files only differ in their format extension I simply replace .txt with .pdf 
      # to get the right name 

      print('File: '+ file) 
      print('Full Path: ' + full_path) 
      print('Full Title: ' + full_title) 
      print('PDF Name: ' + pdf_name) 
      print('....................................') 
      # for trouble shooting 

      dirname = ntpath.dirname(pdf_name) 
      new_path = ntpath.join(dirname, "{0}.pdf".format(full_title)) 

      if ntpath.exists(full_path): 
       print("all paths found") 
       shutil.copy(pdf_name, new_path) 
       # makes a copy of the pdf file with the new name in the respective directory

來源

2015-07-24 23:12:24 Telefonmann

如何閱讀的不同目錄中的txt文件的內容和重命名等文件，根據

回答

相關問題