2013-04-11 37 views
0

如何讓我的代碼只讀取文本文件中的特定單詞並顯示單詞和計數(單詞出現在文本文件中的次數)?我該如何做到這一點,所以我可以閱讀僅用於特定單詞的文本文件?

from collections import Counter 
import re 

def openfile(filename): 
fh = open(filename, "r+") 
str = fh.read() 
fh.close() 
return str 

def removegarbage(str): 
str = re.sub(r'\W+', ' ', str) 
str = str.lower() 
return str 

def getwordbins(words): 
cnt = Counter() 
for word in words: 
    cnt[word] += 1 
return cnt 

def main(filename, topwords): 
    txt = openfile(filename) 
    txt = removegarbage(txt) 
    words = txt.split(' ') 
    bins = getwordbins(words) 
    for key, value in bins.most_common(topwords): 
    print key,value 

    main('filename.txt', 10) 
+2

您需要列出哪些單詞要保留計數,並且只有在輸入單詞在該列表上時纔會添加。作爲一種優化,你可以初始化'cnt'字典以對每個「有趣」的單詞計數爲零,然後在主循環中只有當單詞已經有一個計數時才遞增。 – tripleee 2013-04-11 05:49:01

+0

[請使用一致的縮進](http://www.python.org/dev/peps/pep-0008/)。但無論如何,我不明白這個問題。你想要它做什麼不是你的代碼?你不希望它做什麼? 「只讀特定字詞」是什麼意思?你不知道你要閱讀的單詞是否在「特定單詞」列表中,直到你看它,即閱讀它。 – 2013-04-11 07:11:41

回答

0

這將可能就夠了......你問不完全是,但最終的結果是你想要的(我認爲)

interesting_words = ["ipsum","dolor"] 

some_text = """ 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec viverra consectetur sapien, sed posuere sem rhoncus quis. Mauris sit amet ligula et nulla ultrices commodo sed sit amet odio. Nullam vel lobortis nunc. Donec semper sem ut est convallis posuere adipiscing eros lobortis. Nullam tempus rutrum nulla vitae pretium. Proin ut neque id nisi semper faucibus. Sed sodales magna faucibus lacus tristique ornare. 
""" 

d = Counter(some_text.split()) 
final_list = filter(lambda item:item[0] in interesting_words,d.items()) 

但其複雜性不真棒,因此可能會在需要一段時間大文件和/或大名單「interesting_words」

1

我認爲這樣做很多功能太複雜,爲什麼不在一個單一的功能呢?

# def function if desired 
# you may have the filepath/specific words etc as parameters 

f = open("filename.txt") 
counter=0 
for line in f: 
    # you can remove punctuation, translate them to spaces, 
    # now any interesting words will be surrounded by spaces and 
    # you can detect them 
    line = line.translate(maketrans(".,!? ","  ")) 
    words = line.split() # splits on any number of whitespaces 
    for word in words: 
     if word == specificword: 
      # of use a list of specific words: 
      # if word in specificwordlist: 
      counter+=1 
      print word 
      # you could also append the words to some list, 
      # create a dictionary etc 
f.close() 
+1

re:使用'line.split(「」)' - 別忘了,單詞可以遠遠超過單個空格。可以有兩個空格,三個空格,一個新的線條,頭腦裏有可能性...... – Gibron 2013-04-11 06:16:42

+0

...和標點符號。 – 2013-04-11 06:17:34

+0

哦,對。我添加了一個翻譯將標點符號轉換爲空格。要忽略我現在使用的line.split()的空格數量,它會在任意數量的空格上分開。 – user1451340 2013-04-11 07:21:33

1

是產生在文件中的所有單詞發電機派上用場:

from collections import Counter 
import re 

def words(filename): 
    regex = re.compile(r'\w+') 
    with open(filename) as f: 
     for line in f: 
      for word in regex.findall(line): 
       yield word.lower() 

然後,或者:

wordcount = Counter(words('filename.txt'))    
for word in ['foo', 'bar']: 
    print word, wordcount[word] 

words_to_count = set(['foo', 'bar']) 
wordcount = Counter(word for word in words('filename.txt') 
        if word in words_to_count)    
print wordcount.items() 
1

我想你」重新尋找是一個簡單的字典結構。這會讓你不僅跟蹤你正在尋找的單詞,而且還會記錄它們的數量。

字典將事物存儲爲鍵/值對。因此,例如,您可以使用「alice」這個關鍵字(您想查找的一個字,並將其值設置爲您找到該關鍵字的次數。)

檢查字典中是否有內容的最簡單方法是通過Python的in關鍵字即

if 'pie' in words_in_my_dict: do something 

有了這些信息的方式進行,建立一個字計數器是很容易

def get_word_counts(words_to_count, filename): 
    words = filename.split(' ') 
    for word in words: 
     if word in words_to_count: 
      words_to_count[word] += 1 
    return words_to_count 

if __name__ == '__main__': 

    fake_file_contents = (
     "Alice's Adventures in Wonderland (commonly shortened to " 
     "Alice in Wonderland) is an 1865 novel written by English" 
     " author Charles Lutwidge Dodgson under the pseudonym Lewis" 
     " Carroll.[1] It tells of a girl named Alice who falls " 
     "down a rabbit hole into a fantasy world populated by peculiar," 
     " anthropomorphic creatures. The tale plays with logic, giving " 
     "the story lasting popularity with adults as well as children." 
     "[2] It is considered to be one of the best examples of the literary " 
     "nonsense genre,[2][3] and its narrative course and structure, " 
     "characters and imagery have been enormously influential[3] in " 
     "both popular culture and literature, especially in the fantasy genre." 
     ) 

    words_to_count = { 
     'alice' : 0, 
     'and' : 0, 
     'the' : 0 
     } 

    print get_word_counts(words_to_count, fake_file_contents) 

這使輸出:!

{'and': 4, 'the': 5, 'alice': 0} 

由於dictionary存儲我們要計數的單詞它們出現的次數。整個算法只是檢查每個單詞是否在dict中,如果事實證明我們是,我們將1添加到該單詞的值。

辭書here.閱讀了

編輯:

如果要統計所有的話,然後找到這個任務的一組特定的,字典是仍然很大(快!) 。

我們需要做的唯一變化是首先檢查字典key是否存在,如果不存在,則將其添加到字典中。

def get_all_word_counts(filename): 
    words = filename.split(' ') 

    word_counts = {} 
    for word in words: 
     if word not in word_counts:  #If not already there 
      word_counts[word] = 0 # add it in. 
     word_counts[word] += 1   #Increment the count accordingly 
    return word_counts 

這使輸出:

and : 4 
shortened : 1 
named : 1 
popularity : 1 
peculiar, : 1 
be : 1 
populated : 1 
is : 2 
(commonly : 1 
nonsense : 1 
an : 1 
down : 1 
fantasy : 2 
as : 2 
examples : 1 
have : 1 
in : 4 
girl : 1 
tells : 1 
best : 1 
adults : 1 
one : 1 
literary : 1 
story : 1 
plays : 1 
falls : 1 
author : 1 
giving : 1 
enormously : 1 
been : 1 
its : 1 
The : 1 
to : 2 
written : 1 
under : 1 
genre,[2][3] : 1 
literature, : 1 
into : 1 
pseudonym : 1 
children.[2] : 1 
imagery : 1 
who : 1 
influential[3] : 1 
characters : 1 
Alice's : 1 
Dodgson : 1 
Adventures : 1 
Alice : 2 
popular : 1 
structure, : 1 
1865 : 1 
rabbit : 1 
English : 1 
Lutwidge : 1 
hole : 1 
Carroll.[1] : 1 
with : 2 
by : 2 
especially : 1 
a : 3 
both : 1 
novel : 1 
anthropomorphic : 1 
creatures. : 1 
world : 1 
course : 1 
considered : 1 
Lewis : 1 
Charles : 1 
well : 1 
It : 2 
tale : 1 
narrative : 1 
Wonderland) : 1 
culture : 1 
of : 3 
Wonderland : 1 
the : 5 
genre. : 1 
logic, : 1 
lasting : 1 

注:正如你可以看到有一對夫婦 「擦槍走火」 的時候,我們split(' ')文件。具體來說,有些詞有附加的開頭或結尾括號。你將不得不在你的文件處理中對此進行解釋..但是,我讓你知道!

+0

+1雖然我會用字典來計算每個單詞,然後尋找想要的單詞。 – pwagner 2013-04-11 07:21:23

相關問題