5
我有一些代碼給我一個單詞列表,它們在文本中出現的頻率,我期待它使代碼自動將前10個單詞轉換成一個ARFF從單詞頻率創建ARFF
@RELATION wordfrequencies
@ATTRIBUTE字串 @ATTRIBUTE頻率數字
和頂部10,與它們的頻率的數據。
我與如何與我當前的代碼
import re
import nltk
# Quran subset
filename = 'subsetQuran.txt'
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list2:
# remove punctuation marks
word = punctuation.sub("", word)
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
print word, freq
f = open("wordfreq.txt", "w")
f.write(str(freq_list3))
f.close()
任何幫助表示讚賞爲此掙扎,這樣的方式真的是傷透我的大腦!
不知道這會幫助,但它會告訴你如何做一個ARFF的所有單詞,然後編輯只取前10名? http://stackoverflow.com/questions/5230699/creating-an-arff-file-from-python-output – jenniem001 2011-04-01 10:06:44