2011-05-03 40 views
3

我有以下代碼摘錄找到音節數在給定的輸入文字「sample.txt的」所有的話使用NLTK:文本數音節單詞的

import re 
    import nltk 
    from curses.ascii import isdigit 
    from nltk.corpus import cmudict 
    import nltk.data 
    import pprint 

    d = cmudict.dict() 

    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 
    fp = open("sample.txt") 
    data = fp.read() 
    tokens = nltk.wordpunct_tokenize(data) 
    text = nltk.Text(tokens) 
    words = [w.lower() for w in text] 
    print words #to print all the words in input text 
    regexp = "[A-Za-z]+" 
    exp = re.compile(regexp) 

    def nsyl(word): 
     return max([len([y for y in x if isdigit(y[-1])]) for x in d[word]]) 

    sum1 = 0 
    count = 0 
    count1 = 0 
    for a in words: 
    if exp.match(a)): 
     print a 
     print "no of syllables:",nysl(a) 
     sum1 = sum1 + nysl(a) 
     print "sum of syllables:",sum1 
     if nysl(a)<3: 
      count = count + 1 
     else: 
      count1 = count1 + 1 

    print "no of words with syll count less than 3:",count 
    print "no of complex words:",count1 

此代碼將將每個輸入詞與cmu詞典匹配,並給出該詞的音節數。但它無法工作,並顯示錯誤,因爲該詞在詞典中找不到,或者我在輸入中使用了專有名詞。我想檢查該詞是否存在於詞典中,如果不存在,請跳過它並繼續並考慮下一個單詞。我該怎麼做呢?

回答

2

我猜這個問題是一個關鍵的錯誤。與

def nsyl(word): 
    lowercase = word.lowercase() 
    if lowercase not in d: 
    return -1 
    else: 
    return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]]) 

相反更換你的定義,你可以檢查,看看是否這個詞在字典中首先調用nsyl之前,然後你不必到nsyl方法本身的擔心。

+0

http://groups.google.com/group/nltk-users/browse_thread/thread/9823a1feeed5f3f2/81e70cb6704dc01e顯示使用小寫()版本。這應該照顧專有名詞。 – I82Much 2011-05-03 22:17:55