Peter Norvig的分詞問題：我如何才能將單詞與拼寫錯誤分開？

我試圖理解，Peter Norvig的拼寫校正器是如何工作的。Peter Norvig的分詞問題：我如何才能將單詞與拼寫錯誤分開？

在他jupyter筆記本標題here他解釋說，如何細分的字符序列無空格分開的話。當所有序列中的字寫得正確它的工作原理是正確的，：

>>> segment("deeplearning") 
['deep', 'learning']

但是當序列字（或詞）的拼寫錯誤，它的工作原理不正確：

>>> segment("deeplerning") 
['deep', 'l', 'erning']

不幸的是，我有不知道如何解決這個問題，使段（）函數工作與拼寫錯誤單詞串聯。

有誰有一個想法如何處理這個問題呢？

來源

2017-08-02 Philip Marchenko

我的意思是...這是一個難題。很多研究都涉及這一點。 – erip

你知道關於這個問題的研究的任何文章嗎？ –

它可以由彼得·諾維格的algorithm有細微的變化來實現。訣竅是在字母表中添加一個空格字符，並將由空格字符分隔的所有bigrams視爲一個唯一字。

由於big.txt不含deep learning兩字，我們將不得不多一點點文本添加到我們的字典。我將使用wikipedia library（pip install wikipedia）獲取更多文本。

import re 
import wikipedia as wiki 
import nltk 
from nltk.tokenize import word_tokenize 
unigrams = re.findall(r"\w+", open("big.txt").read().lower()) 
for deeplerning in wiki.search("Deep Learning"): 
    try: 
     page = wiki.page(deeplerning).content.lower() 
     page = page.encode("ascii", errors="ignore") 
     unigrams = unigrams + word_tokenize(page) 
    except: 
     break

我將創建與所有對unigram和雙字母組新的辭典：

fo = open("new_dict.txt", "w") 
for u in unigrams: 
    fo.write(u + "\n") 
bigrams = list(nltk.bigrams(unigrams)) 
for b in bigrams: 
    fo.write(" ".join(b)+ "\n") 
fo.close()

現在只需添加一個space字符到letters變量edits1功能，改變big.txt到new_dict.txt和改變這一功能：

def words(text): return re.findall(r'\w+', text.lower())

對此：

def words(text): return text.split("\n")

現在correction("deeplerning")返回'deep learning'！

如果您需要特定的域拼寫校正這一招將表現良好。如果這個域很大，你可以嘗試添加最常見的unigrams/bigrams到你的字典。

這個question也可能有所幫助。

來源

2017-08-04 08:32:40 Vlad

Peter Norvig的分詞問題：我如何才能將單詞與拼寫錯誤分開？

回答

相關問題