2011-05-31 109 views
1

我想知道是否有一份關於如何實施拼寫檢查的文獻列表。我能找到的一個例子是Peter Norvig的「如何寫一個拼寫糾正器」 - http://norvig.com/spell-correct.html非常不現實。拼寫檢查的文獻?

幾件事情我感興趣的是:

  • 構建拼寫檢查不用藉助字典,(通過使用現有的語料,N-克轉儲如谷歌NGRAM轉儲)。
  • 上下文拼寫檢查。從鏈接
+3

是什麼讓你決定Norvig的例子是不現實的?如果你添加一個錯誤模型,並將它編譯成一個Levenshtein換能器,它應該是一個非常好的基線拼寫檢查器。 – 2011-05-31 17:35:08

回答

0

報價低於

How does it Work? 
The Basic Model 
The basic technology works as follows: The documents that the search engine is providing access to are added both to the search index and a language model. The language model stores seen phrases and maintains statistics about them. When a query is submitted, the src/QuerySpellCheck.java class looks for possible character edits, namely substitutions, insertions, replacements, transpositions, and deletions, that make the query a better fit for the lanaguage model. So if you type 'Gretski' as a query, and the underlying data is data from rec.sport.hockey, the language model will be much more familliar with the mildly edited 'Gretzky' and suggests it as an alternative. 
Domain Sensitivity 
The big advantage of this approach over dictionary-based spell checking is that the corrections are motivated by data in the search index. So "trt" will be corrected to "tort" in a legal domain, "tart" in a cooking domain, and "TRt" in a bio-informatics domain. On Google, there is no suggested correction, presumably because of web domains "trt.com", Thessaly Radio Television as well as Turkiye Radyo Televizyon, both aka TRT, etc. 
Context-Sensitive Correction 
Both Yahoo and Google perform context-sensitive correction. For instance, the query frod (an Old English term from the German meaning wise or experienced) has a suggested correction of ford (the automotive company, among others), whereas the query frod baggins has the corrected query frodo baggins (a 20th century English fictional character). That's the Yahoo behavior. Google doesn't correct frod baggins, even though there are about 785 hits for it versus 820,000 for Frodo Baggins. On the other hand, Google does correct frdo and frdo baggins. Amazon behaves similarly, but MSN corrects frd baggins to ford baggins rather than frodo baggins. 
LingPipe's model supports exactly this kind of context-sensitive correction.

read this great tutorial

+0

儘管這個鏈接可能回答這個問題,但最好在這裏包含答案的重要部分,並提供供參考的鏈接。如果鏈接頁面更改,則僅鏈接答案可能會失效。 – Craigy 2012-08-09 17:23:46

+0

當然,我已經複製了最重要的一段文字 – yura 2012-08-10 05:06:04