匹配包含單詞排列的行

假設您有一個包含varchar列的大表。匹配包含單詞排列的行

你會如何匹配包含在VARCHAR山坳的「首選」，但數據是有點吵，包含偶爾拼寫錯誤，例如，字行：

['$2.10 Cumulative Convertible Preffered Stock, $25 par value', 
'5.95% Preferres Stock', 
'Class A Preffered', 
'Series A Peferred Shares', 
'Series A Perferred Shares', 
'Series A Prefered Stock', 
'Series A Preffered Stock', 
'Perfered', 
'Preffered C']

字的排列在「優選」上面的拼寫錯誤似乎表現爲family resemblance，但它們幾乎沒有什麼共同之處。請注意，拆分每個單詞並在每行中的每個單詞上運行levenshtein將會非常昂貴。

UPDATE：

有幾個這樣的，例如，其它實施例與「限制」：

['Resticted Stock Plan', 
'resticted securities', 
'Ristricted Common Stock', 
'Common stock (restrticted, subject to vesting)', 
'Common Stock (Retricted)', 
'Restircted Stock Award', 
'Restriced Common Stock',]

來源

2009-04-08 ʞɔıu

您是否具體詢問「首選」，或者這是一個普遍問題？ – 2009-04-08 22:27:09

這裏有一小部分其他示例 – 2009-04-08 22:29:38

你能試着訓練它放在你的桌子的小樣本找到潛在的錯誤拼寫（採用分體式+萊文斯坦），然後使用在全表所產生的單詞列表？

來源

2009-04-08 22:18:23

正試圖在TSQL或什麼語言做到這一點？

您可能能夠用正則表達式擊中大多數人。

你想知道那不是蓋敏感以下

"p(er|re|e)f{1,2}er{1,2}ed" 

"r(e|i)s?t(ri|ir|rti|i)ct?ed"

的一些變化......

來源

2009-04-08 22:22:54 bytebender

創建兩個表，拼寫和可能speelings的：

- - 你可以找出類型

create table spelling (id, word) ; 
create table possible_spelling 
(id, spelling_id references spelling(id), spelling) 
-- possible spelling also includes the correct spelling 
-- all values are lowercase 

insert into spelling(word) values ('preferred'); 
insert into possible_spelling(spelling_id, spelling) 
select 1, '%preferred%' union select 1, '%prefered%' union ....; 

select * 
from bigtable a 
join possible_spelling b 
on (lower(a.data) like b.spelling) 
join spelling c on (b.spelling_id = c.id) 
where c.word = 'preferred';

異議：thi會很慢，需要設置。答：不那麼慢，這應該是一次性分類和修復你的數據。一次設置，每傳入一行進行分類一次。

來源

2009-04-08 22:31:53 tpdi

我可能會做這樣的事情 - 如果你能與萊文斯坦逃脫一次 - 這裏是an amazing spellchecker implementation by Peter Norvig：

import re, collections 

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features): 
    model = collections.defaultdict(lambda: 1) 
    for f in features: 
     model[f] += 1 
    return model 

NWORDS = train(words(file('big.txt').read())) 

alphabet = 'abcdefghijklmnopqrstuvwxyz' 

def edits1(word): 
    s = [(word[:i], word[i:]) for i in range(len(word) + 1)] 
    deletes = [a + b[1:] for a, b in s if b] 
    transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] 
    replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] 
    inserts = [a + c + b  for a, b in s for c in alphabet] 
    return set(deletes + transposes + replaces + inserts) 

def known_edits2(word): 
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) 

def known(words): return set(w for w in words if w in NWORDS) 

def correct(word): 
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] 
    return max(candidates, key=NWORDS.get)

他做了個訓練組可用here: http://norvig.com/big.txt下面是示例輸出：

>>> correct('prefferred') 
'preferred' 
>>> correct('ristricted') 
'restricted' 
>>> correct('ristrickted') 
'restricted'

對於您的情況，您可以將原始列複製到新列，但是當您這樣做時將其傳遞給拼寫檢查程序。然後在正確拼寫的列上放置一個fulltext索引，並將其與您的查詢進行匹配，但返回原始列的結果。你只需要做一次，而不是每次計算距離。您也可以對輸入進行拼寫檢查，或者僅將更正後的版本作爲後備檢查。無論哪種方式，都值得研究Norvig的例子。

來源

2009-04-10 07:59:34 unmounted

匹配包含單詞排列的行

回答

相關問題