0

我正在嘗試查找使用nltk模塊在Python中拆分單詞的方法。鑑於我擁有的原始數據,我不確定如何達到我的目標。正如你可以看到很多單詞粘在一起(即'到'和'產生'卡在一個字符串'toproduce'中)。這是從PDF文件中抓取數據的工件,我想找到一種方法,使用Python中的nltk模塊來分割粘連在一起的單詞(即將'toproduce'分成兩個單詞:'to'和'produce';將「標準操作程序」分成三個詞:「標準」,「操作」,「程序」)。在Python中使用nltk模塊拆分單詞

我感謝任何幫助!

回答

1

我相信你會希望在這種情況下使用分詞,我不知道NLTK中的任何分詞功能,將處理無空格的英文句子。您可以改用pyenchant。我僅以示例的方式提供以下代碼。 (它適用於數量較少的相對較短的字符串 - 例如您的示例列表中的字符串 - 但對於較長的字符串或更多的字符串而言效率會非常低)。它需要修改,並且不會成功地將每個字符串字符串在任何情況下。

import enchant # pip install pyenchant 
eng_dict = enchant.Dict("en_US") 

def segment_str(chars, exclude=None): 
    """ 
    Segment a string of chars using the pyenchant vocabulary. 
    Keeps longest possible words that account for all characters, 
    and returns list of segmented words. 

    :param chars: (str) The character string to segment. 
    :param exclude: (set) A set of string to exclude from consideration. 
        (These have been found previously to lead to dead ends.) 
        If an excluded word occurs later in the string, this 
        function will fail. 
    """ 
    words = [] 

    if not chars.isalpha(): # don't check punctuation etc.; needs more work 
     return [chars] 

    if not exclude: 
     exclude = set() 

    working_chars = chars 
    while working_chars: 
     # iterate through segments of the chars starting with the longest segment possible 
     for i in range(len(working_chars), 1, -1): 
      segment = working_chars[:i] 
      if eng_dict.check(segment) and segment not in exclude: 
       words.append(segment) 
       working_chars = working_chars[i:] 
       break 
     else: # no matching segments were found 
      if words: 
       exclude.add(words[-1]) 
       return segment_str(chars, exclude=exclude) 
      # let the user know a word was missing from the dictionary, 
      # but keep the word 
      print('"{chars}" not in dictionary (so just keeping as one segment)!' 
        .format(chars=chars)) 
      return [chars] 
    # return a list of words based on the segmentation 
    return words 

正如你所看到的,這種方法(大概)誤段只有你的字符串之一:

>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework'] 
>>> [segment(chars) for chars in t] 
"genotypes" not in dictionary (so just keeping as one segment) 
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']] 

然後可以使用chain扁平化列表名單:

>>> from itertools import chain 
>>> list(chain.from_iterable(segment_str(chars) for chars in t)) 
"genotypes" not in dictionary (so just keeping as one segment)! 
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework'] 
+0

太棒了,謝謝!這是我尋找的東西。我認爲這可以用nltk語料庫完成,但我很樂意與pyenchant一起工作! – Kookaburra

+0

嘿,我知道這個答案有點古老,但有一件事要被厭倦的是set()默認參數,它會導致一些奇怪的行爲,如果你嘗試: '在[6]中:segment_str(「tookapill 「) Out [6]:['to','okapi','ll'] In [7]:segment_str(」tookapillinibiza「) 」tookapillinibiza「不在字典中(所以只保留一個段)! Out [7]:['tookapillinibiza'] In [8]:segment_str(「tookapill」) 「tookapill」不在字典中(所以只是保持一段)! 輸出[8]:['tookapill']' 我添加了一個默認的None並在使用時進行了檢查:http://effbot.org/zone/default-values.htm –