3
我有以下代碼摘錄找到音節數在給定的輸入文字「sample.txt的」所有的話使用NLTK:文本數音節單詞的
import re
import nltk
from curses.ascii import isdigit
from nltk.corpus import cmudict
import nltk.data
import pprint
d = cmudict.dict()
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("sample.txt")
data = fp.read()
tokens = nltk.wordpunct_tokenize(data)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
print words #to print all the words in input text
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
return max([len([y for y in x if isdigit(y[-1])]) for x in d[word]])
sum1 = 0
count = 0
count1 = 0
for a in words:
if exp.match(a)):
print a
print "no of syllables:",nysl(a)
sum1 = sum1 + nysl(a)
print "sum of syllables:",sum1
if nysl(a)<3:
count = count + 1
else:
count1 = count1 + 1
print "no of words with syll count less than 3:",count
print "no of complex words:",count1
此代碼將將每個輸入詞與cmu詞典匹配,並給出該詞的音節數。但它無法工作,並顯示錯誤,因爲該詞在詞典中找不到,或者我在輸入中使用了專有名詞。我想檢查該詞是否存在於詞典中,如果不存在,請跳過它並繼續並考慮下一個單詞。我該怎麼做呢?
http://groups.google.com/group/nltk-users/browse_thread/thread/9823a1feeed5f3f2/81e70cb6704dc01e顯示使用小寫()版本。這應該照顧專有名詞。 – I82Much 2011-05-03 22:17:55