檢測文本中的英文單詞

-2

我有一個已抓取的數據集，但也包含其中含有大量垃圾的條目。檢測文本中的英文單詞

Name: sdfsdfsdfsd 
Location: asdfdgdfjkgdsfjs 
Education: Science & Literature

目前它存儲在MySQL和Solr中。
是否有任何庫可以在這些字段中查找英文單詞，以便我可以消除垃圾值？我相信這需要一本字典，並且/usr/share/dict/中的默認unix字典似乎足以滿足此用例。

來源

2016-05-29 Yashveer Rana

with open('/usr/share/dict/words') as f: 
    words = set(word.lower() for word in f.read().split() 
       # Really short words aren't much of an indication 
       if len(word) > 3) 

def is_english(text): 
    return bool(words.intersection(text.lower().split())) 
    # or 
    return any(word in words for word in text.lower().split()) 

print(is_english('usfdbg dsuyfbg cat')) 
print(is_english('Science & Literature'))

來源

2016-05-29 11:11:31

這意味着O（n^2）的複雜性，因爲我必須掃描我的數據集中每一行的整個列表。 –

@YashveerRana否，一個集合的點是每個項目的恆定時間查找。 'is_english'是'O（n）'，其中'n'是'text'中的單詞數量，你無法做得比這更好。 –

檢測文本中的英文單詞

回答

相關問題