單詞/表達式列表頻率分佈 - 性能改進

我有另一個Python問題，從匹配預定義單詞表的文本中創建頻率分佈。事實上，我使用了超過100,000個文本文件（每個文件大約包含15,000個單詞），我想要讀取這些文件並相應地匹配大約6,000個條目的單詞/表達式列表（vocabulary_dict）。結果應該是所有條目附有各自頻率的詞典。這是我目前在做什麼：單詞/表達式列表頻率分佈 - 性能改進

sample text = "As the prices of U.S. homes started to falter, doubts arose throughout the global financial system. Banks became weaker, private credit markets stopped functioning, and by the end of the year it was clear that the world banks had sunk into a global recession." 

vocabulary_dict = ['prices', 'banks', 'world banks', 'private credit marktes', 'recession', 'global recession'] 

def list_textfiles(directory): 
    # Creates a list of all files stored in DIRECTORY ending on '.txt' 
    textfiles = [] 
    for filename in listdir(directory): 
     if filename.endswith(".txt"): 
      textfiles.append(directory + "/" + filename) 
    return textfiles 

for filename in list_textfiles(directory): 
    # inread each report as textfile, match tokenized text with predefined wordlist and count number of occurences of each element of that wordlist 
    sample_text = read_textfile(filename).lower() 
    splitted = nltk.word_tokenize(sample_text) 
    c = Counter() 
    c.update(splitted) 
    outfile = open(filename[:-4] + '_output' + '.txt', mode = 'w') 
    string = str(filename) # write certain part of filename to outfile 
    string_print = string[string.rfind('/')+1:string.find('-')] + ':' + string[-6:-4] + '.' + string[-8:-6] + '.' + string[-12:-8] 
    for k in sorted(vocabulary_dict): 
    # recognize if wordlist element consist of one or more tokens, accordingly find matching token length in text file (report) 
     spl = k.split() 
     ln = len(spl) 
     if ln > 1: 
      if re.findall(r'\b{0}\b'.format(k),sample_text): 
       vocabulary_dict[k.lower()] += 1 
     elif k in sample_text.split(): 
      vocabulary_dict[k.lower()] += c[k] 
    outfile.write(string_print + '\n') 
    # line wise write each entry of the dictionary to the corresponding outputfile including comapany name, fiscal year end and tabulated frequency distribution 
    for key, value in sorted(vocabulary_dict.items()): 
     outfile.write(str(key) + '\t' + str(value) + '\n') 
    outfile.close() 

# Output accoring to the above stated example should be in the form: 
"selected part of filename (=string1)" 
'prices' 1 
'banks' 2 
'world banks' 1 
'private credit marktes' 1 
'recession' 1 
'global recession' 1

代碼工作得很好，我仍覺得有優化的餘地，因爲處理一個文本文件的時間大約1分鐘。我的問題：有沒有辦法使文本與單詞/表達式列表的匹配更快？非常感謝您的幫助:)

來源

2015-02-07 Dominik Scheld

我不知道這是否更快，但它絕對更短。給它一個旋轉？

text = "As the prices of U.S. homes started to falter, doubts arose throughout the global financial system. Banks became weaker, private credit markets stopped functioning, and by the end of the year it was clear that the world banks had sunk into a global recession." 

newDict = dict((k, text.count(k) + text.count(k.title())) for k in vocabulary_dict)

在任何情況下，你應該問這個問題上CodeReview

來源

2015-02-07 11:10:10 Eithos

單詞/表達式列表頻率分佈 - 性能改進

回答

相關問題