2016-07-27 47 views
-1

我需要以自動方式將#tags分割爲有意義的單詞。需要將#tags分割爲文本

樣品輸入:

  • iloveusa
  • mycrushlike
  • mydadhero

樣本輸出

  • 我愛美國
  • 我暗戀像
  • 我爸英雄

任何實用程序或開放的API,我可以用它來實現這一目標?

+0

的[上邊界分割詞]可能的複製(http://stackoverflow.com/questions/39781936/split-words-on-boundary) – tripleee

回答

1

檢查 - Word Segmentation TaskNorvig的工作。

from __future__ import division 
from collections import Counter 
import re, nltk 

WORDS = nltk.corpus.brown.words() 
COUNTS = Counter(WORDS) 

def pdist(counter): 
    "Make a probability distribution, given evidence from a Counter." 
    N = sum(counter.values()) 
    return lambda x: counter[x]/N 

P = pdist(COUNTS) 

def Pwords(words): 
    "Probability of words, assuming each word is independent of others." 
    return product(P(w) for w in words) 

def product(nums): 
    "Multiply the numbers together. (Like `sum`, but with multiplication.)" 
    result = 1 
    for x in nums: 
     result *= x 
    return result 

def splits(text, start=0, L=20): 
    "Return a list of all (first, rest) pairs; start <= len(first) <= L." 
    return [(text[:i], text[i:]) 
      for i in range(start, min(len(text), L)+1)] 

def segment(text): 
    "Return a list of words that is the most probable segmentation of text." 
    if not text: 
     return [] 
    else: 
     candidates = ([first] + segment(rest) 
         for (first, rest) in splits(text, 1)) 
     return max(candidates, key=Pwords) 

print segment('iloveusa')  # ['i', 'love', 'us', 'a'] 
print segment('mycrushlike') # ['my', 'crush', 'like'] 
print segment('mydadhero') # ['my', 'dad', 'hero'] 

要獲得比此更好的解決方案,您可以使用bigram/trigram。

更多的例子在:Word Segmentation Task