邊界上的分詞

我有一些推文，我希望分成單詞。其中大部分工作正常，除非人們將以下字詞組合：trumpisamoron或makeamericagreatagain。但是，那麼也有像password這樣的東西，不應該分成pass和word。邊界上的分詞

我知道nltk包有一個punkt tokenizer模塊，它以智能的方式將句子分開。有沒有類似的話？即使它不在nltk包中？

注意：password -> pass + word的示例比拆分詞問題要少得多。

來源

2016-09-30 Sachin_ruk

，如果你以'＃hashtags'工作，他們應該區別對待（不過這只是我個人的意見 – alvas

不認爲這是有幫助的，但你可以得到一個。英文中的所有單詞的文本文件，並將推文與單詞進行比較，但這並非100％準確， – Corgs

這將是絕對的蠻力解決方案（可能需要大量的計算能力），但您可以短語'trumpisamoron'並運行該字符串中所有可能的單詞排列，並比較每個單詞出現的可能性與'word：frequency'鍵值對的字典，這基本上意味着您將測試't'，'tr'中的哪一個，'tru'，'trum'或'trump'更可能是a字。我不會推薦這個解決方案，但根據數據的大小，它可能是可行的。 – blacksite

Ref：My Answer on another Question - Need to split #tags to text。

本答覆中的更改我做了 - （1）不同的語料庫獲得WORDS和（2）增加def memo(f)加速過程。您可能需要根據您正在處理的域添加/使用語料庫。

檢查 - Word Segmentation Task從Norvig的工作。

from __future__ import division 
from collections import Counter 
import re, nltk 
from datetime import datetime 

WORDS = nltk.corpus.reuters.words() + nltk.corpus.words.words() 
COUNTS = Counter(WORDS) 

def memo(f): 
    "Memoize function f, whose args must all be hashable." 
    cache = {} 
    def fmemo(*args): 
     if args not in cache: 
      cache[args] = f(*args) 
     return cache[args] 
    fmemo.cache = cache 
    return fmemo 

def pdist(counter): 
    "Make a probability distribution, given evidence from a Counter." 
    N = sum(counter.values()) 
    return lambda x: counter[x]/N 

P = pdist(COUNTS) 

def Pwords(words): 
    "Probability of words, assuming each word is independent of others." 
    return product(P(w) for w in words) 

def product(nums): 
    "Multiply the numbers together. (Like `sum`, but with multiplication.)" 
    result = 1 
    for x in nums: 
     result *= x 
    return result 

def splits(text, start=0, L=20): 
    "Return a list of all (first, rest) pairs; start <= len(first) <= L." 
    return [(text[:i], text[i:]) 
      for i in range(start, min(len(text), L)+1)] 

@memo 
def segment(text): 
    "Return a list of words that is the most probable segmentation of text." 
    if not text: 
     return [] 
    else: 
     candidates = ([first] + segment(rest) 
         for (first, rest) in splits(text, 1)) 
     return max(candidates, key=Pwords) 

print segment('password')  # ['password'] 
print segment('makeamericagreatagain')  # ['make', 'america', 'great', 'again'] 
print segment('trumpisamoron')  # ['trump', 'is', 'a', 'moron'] 
print segment('narcisticidiots')  # ['narcistic', 'idiot', 's']

有時，在情況下，字被灑入較小的令牌，可能會有更高的機率詞不存在於我們的WORDS字典。

在這裏最後一段，它打破了narcisticidiots爲3個標記，因爲標記idiots不在我們的WORDS。

# Check for sample word 'idiots' 
if 'idiots' in WORDS: 
    print("YES") 
else: 
    print("NO")

您可以將新的用戶定義單詞添加到WORDS。

. 
. 
user_words = [] 
user_words.append('idiots') 

WORDS+=user_words 
COUNTS = Counter(WORDS) 
. 
. 
. 
print segment('narcisticidiots')  # ['narcistic', 'idiots']

要獲得比此更好的解決方案，您可以使用bigram/trigram。

在更多的例子：Word Segmentation Task

來源

2016-09-30 21:08:36 RAVI

邊界上的分詞

回答

相關問題