爲文本文件中的每一行計數（並書寫）文字頻率

第一次張貼在文本文件中 - 總能找到以前能夠解決問題的問題！我的主要問題是邏輯......即使是僞代碼答案也會很棒。爲文本文件中的每一行計數（並書寫）文字頻率

我使用python從一個文本文件中的每一行數據讀取，格式爲：

This is a tweet captured from the twitter api #hashtag http://url.com/site

使用NLTK，我可以通過線標記化則可以使用reader.sents（）迭代通過等：

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer()) 

reader.sents()[:10]

但我想進行計數的某些「熱詞」（存儲在數組中或類似的）每行的頻率，然後將它們寫回一個文本文件。如果我使用reader.words（），我可以計算整個文本中「熱門詞彙」的頻率，但是我正在尋找每行的數量（或本例中的「句子」）。

理想的情況下，這樣的：

hotwords = (['tweet'], ['twitter']) 

for each line 
    tokenize into words. 
    for each word in line 
     if word is equal to hotword[1], hotword1 count ++ 
     if word is equal to hotword[2], hotword2 count ++ 
    at end of line, for each hotword[index] 
     filewrite count,

而且，不那麼擔心URL變得破碎（使用WordPunctTokenizer會刪除標點 - 那不是問題）

任何有用的線索（包括僞或鏈接到其他類似的代碼）會很好。

----編輯------------------

結束了做這樣的事情：

import nltk 
from nltk.corpus.reader import TaggedCorpusReader 
from nltk.tokenize import LineTokenizer 
#from nltk.tokenize import WordPunctTokenizer 
from collections import defaultdict 

# Create reader and generate corpus from all txt files in dir. 
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus' 
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer()) 
print "Reader accessible." 
print filereader.fileids() 

#define hotwords 
hotwords = ('cool','foo','bar') 

tweetdict = [] 

for line in filereader.sents(): 
wordcounts = defaultdict(int) 
    for word in line: 
     if word in hotwords: 
      wordcounts[word] += 1 
    tweetdict.append(wordcounts)

輸出是：

print tweetdict 

[defaultdict(<type 'dict'>, {}), 
defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}), 
defaultdict(<type 'int'>, {'cool': 1})]

來源

2011-04-08 bhalsall

defaultdict是你這種事情的朋友。

from collections import defaultdict 
for line in myfile: 
    # tokenize 
    word_counts = defaultdict(int) 
    for word in line: 
     if word in hotwords: 
      word_counts[word] += 1 
    print '\n'.join('%s: %s' % (k, v) for k, v in word_counts.items())

來源

2011-04-08 13:36:48

是的 - 只是稍微調整了這一點，但邏輯是偉大的 - 首選這個櫃檯解決方案。爲文本文件中的每行創建一個defaultdict最有效嗎？ – bhalsall 2011-04-08 15:35:58

@bhalsall：你可以在每行之後調用'word_counts.clear（）'，而不是每次創建一個新的defaultdict。 – jfs 2011-04-09 10:13:07

你需要標記它嗎？您可以在每行上爲每個詞使用count()。

hotwords = {'tweet':[], 'twitter':[]} 
for line in file_obj: 
    for word in hotwords.keys(): 
     hotwords[word].append(line.count(word))

來源

2011-04-08 13:25:29 nmichaels

最終你會以其他方式計算的子字符串。如果熱門詞彙=='性'，我不希望米德爾塞克斯被計數 – 2011-04-08 13:27:50

@Steve：啊，對。 – nmichaels 2011-04-08 13:30:20

這是正確的事情，但。理想情況下，我需要將每一行重新標記爲單詞。我不能只從一開始就將詞彙標記爲單詞，因爲那樣我就不會保留新的分隔符（這是分隔每個推文的地方）......我最終計算整個文本文件的詞頻，而不是每行。 – bhalsall 2011-04-08 13:33:45

from collections import Counter 

hotwords = ('tweet', 'twitter') 

lines = "a b c tweet d e f\ng h i j k twitter\n\na" 

c = Counter(lines.split()) 

for hotword in hotwords: 
    print hotword, c[hotword]

此腳本適用蟒蛇2.7+

來源

2011-04-08 13:37:23 razpeitia

你也可以使用'most_common'像'c.most_common（10）'來獲得計數器中最常用的10個單詞。 – razpeitia 2011-04-08 13:48:19

我打算建議使用像@Daniel Roseman這樣的字典{String word：int count}，但這看起來更光滑。 – Tom 2011-04-08 14:52:14

爲文本文件中的每一行計數（並書寫）文字頻率

回答

相關問題