Python，通過文件夾中的文件循環並做一個字數

我是python的新手，我需要編寫一個腳本來計算目錄中所有txt文件中的所有單詞。這是我迄今爲止，其他工作時只是打開一個txt文件，但當我進入一個目錄失敗。我知道我需要一個追加的地方，我嘗試了幾種不同的方式，但運氣不錯。Python，通過文件夾中的文件循環並做一個字數

*編輯我希望將結果放在一起。迄今爲止，它有兩個單獨的結果我嘗試製作一個新的清單，並附上計數器。但它打破了。再次感謝，這是一個良好的社區

import re 
import os 
import sys 
import os.path 
import fnmatch 
import collections 

def search(file): 

    if os.path.isdir(path) == True: 
     for root, dirs, files in os.walk(path): 
      for file in files: 
       words = re.findall('\w+', open(file).read().lower()) 
       ignore = ['the','a','if','in','it','of','or','on','and','to'] 
       counter=collections.Counter(x for x in words if x not in ignore) 
       print(counter.most_common(10)) 

    else: 
     words = re.findall('\w+', open(path).read().lower()) 
     ignore = ['the','a','if','in','it','of','or','on','and','to'] 
     counter=collections.Counter(x for x in words if x not in ignore) 
     print(counter.most_common(10)) 

path = input("Enter file and path, place ' before and after the file path: ") 
search(path) 

raw_input("Press enter to close: ")

來源

2012-01-31 Garrett

這是什麼意思「它失敗」？除此之外，我無法在任何地方看到'.txt'限制。 – eumiro 2012-01-31 15:28:44

'如果os.path.isdir（路徑）== True'可以縮短爲'如果os.path.isdir（路徑）' – unutbu 2012-01-31 15:31:28

行更改14：

words = re.findall('\w+', open(os.path.join(root, file)).read().lower())

另外，如果你有

path = raw_input("Enter file and path")

更換輸入線然後，你將不再需要前後路徑之後，包括「

來源

2012-01-31 15:34:14

非常感謝，我知道它是次要的。我看過這個。我應該添加另一個列表，只是有計數器= collections.Counter（X爲x的單詞如果x不忽略）附加到新列表然後打印它？ – Garrett 2012-01-31 15:40:22

這取決於你正在嘗試做什麼。你只是想打印每個文件在每個文件中出現的次數？你想在_all_文件中找到最常用的單詞嗎？ – 2012-01-31 16:32:24

atm它打印每個文件的10個最常用的單詞。 seperately。我希望它能給我所有文件合併使用的10個最常用的單詞。 ty – Garrett 2012-01-31 16:34:20

它看起來像函數定義的參數是錯誤的。它應該是：

def search(path):

的忽略是正確的，但可以更快地通過使用作了設定，而不是一個列表：

ignore = set(['the','a','if','in','it','of','or','on','and','to'])

否則，這是很好看的代碼:-)

來源

2012-01-31 15:31:36

迭代os.walk的結果時，file將只包含沒有包含它的目錄的文件名。您需要使用的文件名加入目錄名稱：

for root, dirs, files in os.walk(path): 
    for name in files: 
     file_path = os.path.join(root, name) 
     #do processing on file_path here

我建議移動它處理文件，以自身的功能代碼 - 這樣你就不必把它寫了兩次，它會更容易調試問題。

來源

2012-01-31 15:31:47 interjay

這是因爲「文件」列表只包含文件名，而不包含完整路徑。你必須使用：

進口os.path中

...

，並通過「開放（os.path.join（根文件））」代替「開放（文件）」。

來源

2012-01-31 15:31:52 huelbois

我建議看看generator tricks for system programmers by David M. Beazley。它展示瞭如何創建小型發電機迴路來完成您在這裏所做的一切。基本上，使用gengrep例子，但字計數更換的grep：

# gencount.py 
# 
# Count the words in a sequence of lines 

import re, collections 
def gen_count(lines): 
    patc = re.compile('\w+') 
    ignore = ['the','a','if','in','it','of','or','on','and','to'] 
    for line in lines: 
     words = patc.findall(line) 
     counter=collections.Counter(x for x in words if x not in ignore) 
     for count in counter.most_common(10): 
      yield count 

# Example use 

if __name__ == '__main__': 
    from genfind import gen_find 
    from genopen import gen_open 
    from gencat import gen_cat 
    path = raw_input("Enter file and path, place ' before and after the file path: ") 

    findnames = gen_find("*.txt",path) 
    openfiles = gen_open(findnames) 
    alllines = gen_cat(openfiles) 

    currcount = gen_count(alllines) 
    for c in currcount: 
     print c

來源

2012-01-31 15:40:14

更改爲：

for file in files: 
    fullPath="%s/%s"%(path,file)

來源

2012-01-31 15:40:38 Sid

你應該有兩個功能：一個是通過文件去和計算的話，另一個通過目錄中的文件並在發現目錄時遞歸調用自身。每個文件功能應該採用文件的完整路徑並打開文件本身。
一次讀取整個文件可能會讓你失去內存。逐行方法更好。甚至比這更好的是編寫一個發生器函數，一次讀取4K，並輸出單個單詞，但這可能會超出這個任務。
看看os.path.walk()。
如果您使用的是Python 2，請使用raw_input。人們將忽略「quote the path」提示。

來源

2012-01-31 15:42:32

Python，通過文件夾中的文件循環並做一個字數

回答

相關問題