nltk自定義標記器和標記器

這是我的要求。我想以這樣的方式標記和標記段落，以使我能夠實現以下內容。nltk自定義標記器和標記器

應確定日期和時間段和標記他們爲DATE和TIME
應確定在一段已知的短語和標籤爲自定義
和休息含量應標記化應由被標記化默認nltk的word_tokenize和pos_tag函數？

例如，以下sentense

"They all like to go there on 5th November 2010, but I am not interested."

應被標記和標記化作爲在自定義短語的情況下，下面是「我不感興趣」。

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

任何建議都將是有用的。

來源

2010-10-14 Software Enthusiastic

你是怎麼解決這個問題？我有一個類似的用例，我需要用自定義標籤在不同的句子中標記已知的短語。 – AgentX 2017-07-17 09:38:20

正確的答案是編譯一個大型的數據集，以你想要的方式標記，然後訓練一個機器學習的chunker就可以了。如果這太耗時，最簡單的方法是運行POS標記器並使用正則表達式對其輸出進行後處理。獲得最長的比賽是困難的部分在這裏：

s = "They all like to go there on 5th November 2010, but I am not interested." 

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)([12][0-9][0-9][0-9])?$') 

def custom_tagger(sentence): 
    tagged = pos_tag(word_tokenize(sentence)) 
    phrase = [] 
    date_found = False 

    i = 0 
    while i < len(tagged): 
     (w,t) = tagged[i] 
     phrase.append(w) 
     in_date = DATE.match(' '.join(phrase)) 
     date_found |= bool(in_date) 
     if date_found and not in_date:   # end of date found 
      yield (' '.join(phrase[:-1]), 'DATE') 
      phrase = [] 
      date_found = False 
     elif date_found and i == len(tagged)-1: # end of date found 
      yield (' '.join(phrase), 'DATE') 
      return 
     else: 
      i += 1 
      if not in_date: 
       yield (w,t) 
       phrase = []

TODO：擴大DATE重新插入代碼搜索CUSTOM短語，使通過匹配POS標籤，以及令牌這個更復雜，並決定是否5th其自己應該算作約會。（可能不會，所以過濾掉只包含序號的長度的日期。）

來源

2010-10-14 13:33:53

感謝分享代碼，請讓我試試這個，我會盡快回復您... – 2010-10-16 05:28:36

您應該使用nltk.RegexpParser來實現您的目標。

參考： http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

來源

2010-10-14 20:39:11 Neodawn

讓我通過它，我會回到你身邊... – 2010-10-18 06:53:37

nltk自定義標記器和標記器

回答

相關問題