我打算假設你完全迷失在這裏,並且一步一步地做,所以你可以想出如何在將來做到這一點。但它聽起來像你需要做一些python教程。所有這些都是未經測試的代碼。
獲取查詢在一個合理的格式:
query_string = query_array[1:-1] #remove the parentheses with slicing
queries_with_whitespace = query_string.split(",") #split the string into a list
queries = [query.strip() for query in queries_with_whitespace] #remove whitespace
# queries = [item.strip() for item in query_array[1:-1].split(",")] #all in one
同爲同義詞。以下是您的一個詞幹串:
def stem_and_syns(unformatted_string): #unformatted string is your stem_array
stem_string = unformatted_string[1:-1] #same as before
stem, synonyms_string = stem_string.split("|") #split the stem and synonyms
stem = stem.strip() #clean the stem
synonyms = [synonym.strip() for synonym in synonym_string.split(",")] #same as before
return stem, synonyms
但是,您需要同義詞進行反向查找。你是否意識到,對於任何給定的詞,它可能是一個詞幹以及同義詞?而且任何一個詞都可以有多個詞幹?你需要弄清楚在這種情況下要做什麼。總之,這裏是反向查找:
stem_lookup = {}
for stem_string in stem_strings #stem_strings is the set of all of your non-formatted stem strings
stem, synonyms = stem_and_syns(stem_string)
for synonym in synonyms:
#point all synonyms to a list of possible stems
stem_lookup.setdefault(synonym, []).append(stem)) #make a new list if this synonym not used yet
最後,從一開始查詢(再次,這使得一組是容易的,我的假設,但可能不符合你的需求):
result = [stem_lookup.get(original,original) for original in queries] #uses original itself if it's not a synonym
你的數據是否以這種格式字面表達?例如一個字符串「(president | presidential)」 – KobeJohn 2012-01-31 00:58:50
粘貼實際代碼。你所描述的「陣列」是沒有意義的。 – JBernardo 2012-01-31 00:59:38
@yakiimo:是的。它在字面上是以這種格式。 – Nerd 2012-01-31 01:00:06