模糊搜索的Python

我有一個大樣本的文字，例如：模糊搜索的Python

「動脈高血壓可接合預後存活病人爲併發症的結果TENSTATEN進入框架內。（治療）他的（她，她的）報告（關係）效率/效果不需要的是重要的。利尿劑，第一意向的藥物TENSTATEN，是。

我試圖檢測是否在文本中以模糊的方式「參與預測生存」。例如「參與生存的程序」也必須返回一個肯定的答案。

我看着fuzzywuzzy，NLTK和新的正則表達式的模糊功能，但我沒有找到一個方法來做到：

if [anything similar (>90%) to "that sentence"] in mybigtext: 
    print True

來源

2016-02-29 Mickael_Paris

即時通訊新的在這裏，但我認爲這應該解決您的問題：http://stackoverflow.com/questions/30449452/python-fuzzy-text-search?rq=1 –

看看[gensim]（https：/ /radimrehurek.com/gensim/index.html），特別是[相似部分]（https://radimrehurek.com/gensim/tut3.html）。 – Jan

有低於此，如果一個字包含的文本它將裏面的函數顯示一個匹配。您可以即興創作，以便在文本中檢查完整的短語。

這是我提出的功能：

def FuzzySearch(text, phrase): 
    """Check if word in phrase is contained in text""" 
    phrases = phrase.split(" ") 

    for x in range(len(phrases)): 
     if phrases[x] in text: 
      print("Match! Found " + phrases[x] + " in text") 
     else: 
      continue

來源

2016-02-29 17:52:16

是啊，這是我的第一次猜測，但沒辦法使句子明智模糊... –

以下是不理想，但它應該讓你開始。它首先使用nltk將文本分成單詞，然後生成一個包含所有單詞的詞幹的集合，過濾任何停用詞。它可以爲您的示例文本和示例查詢做到這一點。

如果兩個集合的交集包含查詢中的所有單詞，則認爲它是匹配的。

import nltk 

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 

stop_words = stopwords.words('english') 
ps = PorterStemmer() 

def get_word_set(text): 
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words) 

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

query = "engage the prognosis for survival" 

set_query = get_word_set(query) 
for text in [text1, text2]: 
    set_text = get_word_set(text) 
    intersection = set_query & set_text 

    print "Query:", set_query 
    print "Test:", set_text 
    print "Intersection:", intersection 
    print "Match:", len(intersection) == len(set_query) 
    print

該腳本提供兩個文本，一個通行證和其他沒有，它產生以下輸出向您展示它在做什麼：

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'prognosi', u'engag', u'surviv']) 
Match: True 

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'engag', u'surviv']) 
Match: False

來源

2016-02-29 20:32:08

是的，我想過這種可能性！如果我真的找不到任何其他解決方案，我會使用那個！謝謝！ –

使用regex模塊，第一次分裂的句子然後測試是否模糊圖案是在句子：

tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt): 
    pat=r'(?e)((?:has engage the progronosis of survival){e<%i})' 
    pat=pat % int(len(pat)/5) 
    m=regex.search(pat, sentence) 
    if m: 
     print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts)

打印：

'(?e)((?:has engage the progronosis of survival){e<10})' 
    fuzzy matches 
'may engage the prognosis for survival' 
    with 
3 substitutions, 1 insertions, 2 deletions

來源

2016-02-29 21:41:19 dawg

因此，通過玩數字模糊數字像限制他們......我可以做一些事情之間的區別：'已經搞預後'和'不搞預後' 這似乎是完美的感謝！如果是這種情況，我會盡力解決我的問題。 –

模糊搜索的Python

回答

相關問題