2011-03-31 135 views
5

我有一些代碼給我一個單詞列表,它們在文本中出現的頻率,我期待它使代碼自動將前10個單詞轉換成一個ARFF從單詞頻率創建ARFF

@RELATION wordfrequencies

@ATTRIBUTE字串 @ATTRIBUTE頻率數字

和頂部10,與它們的頻率的數據。

我與如何與我當前的代碼

import re 
import nltk 

# Quran subset 
filename = 'subsetQuran.txt' 

# create list of lower case words 
word_list = re.split('\s+', file(filename).read().lower()) 
print 'Words in text:', len(word_list) 

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')] 



# create dictionary of word:frequency pairs 
freq_dic = {} 
# punctuation and numbers to be removed 
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list2: 
    # remove punctuation marks 
    word = punctuation.sub("", word) 
    # form dictionary 
    try: 
     freq_dic[word] += 1 
    except: 
     freq_dic[word] = 1 


print '-'*30 

print "sorted by highest frequency first:" 
# create list of (val, key) tuple pairs 
freq_list2 = [(val, key) for key, val in freq_dic.items()] 
# sort by val or frequency 
freq_list2.sort(reverse=True) 
freq_list3 = list(freq_list2) 
# display result 
for freq, word in freq_list2: 
    print word, freq 
f = open("wordfreq.txt", "w") 
f.write(str(freq_list3)) 
f.close() 

任何幫助表示讚賞爲此掙扎,這樣的方式真的是傷透我的大腦!

+0

不知道這會幫助,但它會告訴你如何做一個ARFF的所有單詞,然後編輯只取前10名? http://stackoverflow.com/questions/5230699/creating-an-arff-file-from-python-output – jenniem001 2011-04-01 10:06:44

回答

1

我希望你不介意稍微改寫:

import re 
import nltk 
from collections import defaultdict 

# Quran subset 
filename = 'subsetQuran.txt' 

# create list of lower case words 
word_list = open(filename).read().lower().split() 
print 'Words in text:', len(word_list) 

# remove stopwords 
word_list = [w for w in word_list if w not in nltk.corpus.stopwords.words('english')] 

# create dictionary of word:frequency pairs 
freq_dic = defaultdict(int) 

# punctuation and numbers to be removed 
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list: 
    # remove punctuation marks 
    word = punctuation.sub("", word) 
    # increment count for word 
    freq_dic[word] += 1 

print '-' * 30 

print "sorted by highest frequency first:" 
# create list of (frequency, word) tuple pairs 
freq_list = [(freq, word) for word, freq in freq_dic.items()] 

# sort by descending frequency 
freq_list.sort(reverse=True) 

# display result 
for freq, word in freq_list: 
    print word, freq 

# write ARFF file for 10 most common words 
f = open("wordfreq.txt", "w") 
f.write("@RELATION wordfrequencies\n") 
f.write("@ATTRIBUTE word string\n") 
f.write("@ATTRIBUTE frequency numeric\n") 
f.write("@DATA\n") 
for freq, word in freq_list[ : 10]: 
    f.write("'%s',%d\n" % (word, freq)) 
f.close()