如何將給定的文本分解爲字典中的單詞？

這是一個面試問題。假設你有一個字符串text和一個dictionary（一組字符串）。如何將text分解爲子字符串，以便在dictionary中找到每個子字符串。如何將給定的文本分解爲字典中的單詞？

例如，您可以使用/usr/share/dict/words將"thisisatext"分解爲["this", "is", "a", "text"]。

我相信回溯可以解決這個問題（在僞的Java）：

 
void solve(String s, Set<String> dict, List<String> solution) { 
    if (s.length == 0) 
     return 
    for each prefix of s found in dict 
     solve(s without prefix, dict, solution + prefix) 
} 

List<String> solution = new List<String>() 
solve(text, dict, solution)

是否有意義？你會優化在詞典中搜索前綴的步驟嗎？你會推薦什麼樣的數據結構？

來源

2012-01-09 Michael

糾正我，如果我錯了，但你的解決方案是非多項式。使用trie和DP最多可以解決O（n^2）（實際上是O（k），其中k是字典中最長單詞的長度）。讓我知道你是否需要答案。 – ElKamina 2012-01-09 21:05:46

@ElKamina謝謝。我想聽取DP解決方案 – Michael 2012-01-09 22:38:19

該解決方案假設字典的Trie數據結構存在。此外，對於Trie中的每個節點，假定以下功能：

節點。IsWord（）：如果路徑到該節點就是一個字
node.IsChild（字符x）返回真：返回子：如存在與標籤子X
node.GetChild（字符x）返回真與標籤節點x

Function annotate(String str, int start, int end, int root[], TrieNode node): 
i = start 
while i<=end: 
    if node.IsChild (str[i]): 
     node = node.GetChild(str[i]) 
     if node.IsWord(): 
      root[i+1] = start 
     i+=1 
    else: 
     break; 

end = len(str)-1 
root = [-1 for i in range(len(str)+1)] 
for start= 0:end: 
    if start = 0 or root[start]>=0: 
     annotate(str, start, end, root, trieRoot) 

index 0 1 2 3 4 5 6 7 8 9 10 11 
str: t h i s i s a t e x t 
root: -1 -1 -1 -1 0 -1 4 6 -1 6 -1 7

我會留下一部分給你列出彌補通過反向遍歷根弦的話。

時間複雜度爲O（nk）其中n是字符串的長度，k是字典中最長單詞的長度。 PS：我假設在字典中有以下單詞：這，是，一個，文本，吃了。

來源

2012-01-09 23:43:52 ElKamina

根不需要是一個列表數組嗎？否則，你將失去通過字符串聚集在同一地點的多個路徑 – 2012-01-09 23:56:07

否則，很好的解決方案:) – 2012-01-09 23:56:46

@TimothyJones我認爲海報想要一個解決方案，而不是所有的解決方案。你是對的，通過列表你可以打印所有組成字符串的單詞組合。 – ElKamina 2012-01-09 23:59:53

方法1- Trie看起來很適合這裏。在英文字典中生成單詞的字典。這棟建築是一次性成本。之後，建立你的string可以很容易地比較字母。如果在任何時候遇到樹中的葉子，你可以假設你找到了一個單詞，將它添加到列表中&用你的遍歷移動。遍歷直到到達string的末尾。該列表是輸出。

搜索的時間複雜度 - O（word_length）。

空間複雜度 - O（charsize * word_length * no_words）。字典的大小。

方法2 -我聽說Suffix Trees，從來沒有使用它們，但它可能在這裏很有用。

方法3 -更迂腐&一個糟糕的選擇。你已經提出了這個建議。

你可以嘗試另一種方式。通過dict運行檢查是否匹配子字符串。在這裏，我假設dict中的密鑰是英文字典/usr/share/dict/words的words。所以僞代碼看起來是這樣的 -

(list) splitIntoWords(String str, dict d) 
{ 
    words = [] 
    for (word in d) 
    { 
     if word in str 
      words.append(word); 
    } 
    return words; 
}

的複雜性 - 運行O（n）到（1）串匹配整個字典+ O。

空間 - 最壞的情況下爲O（n），如果len(words) == len(dict)

正如其他人所指出的那樣，這確實需要回溯。

來源

2012-01-09 18:46:15

您仍然必須處理回溯，對吧？如果你的字典同時包含「the」和「these」，那麼輸入「thesebugs」和「themts」將會導致問題。 – 2012-01-09 21:11:46

這似乎只能找到字符串中出現的那些單詞。問題中還有一個額外的條件 - 單詞必須覆蓋整個字符串而不重疊。 – 2012-01-09 22:23:28

我不認爲O（1）查找對於trie是正確的。 – 2012-01-09 23:51:46

有對該blog post.

的基本思想是隻爲了memoize的你寫的功能，你就會有一個O解決這個問題的一個非常徹底的書面記錄（N^2）時間， O（n）空間算法。

來源

2012-01-09 22:37:51

+1好的答案，附加評論幾種方法以及各種候選人如何迴應。正如博主所言，如果有人無法勝任這個玩具問題，他們在大規模的信息檢索和自然語言處理方面將會非常困難。 – Iterator 2012-01-10 05:01:21

您可以使用Dynamic Programming和Hashing來解決此問題。

計算字典中每個單詞的散列。使用你最喜歡的散列函數。我會用（a1 * B ^（n-1）+ a2 * B ^（n-2）+ ... + an * B^0）％P，其中a1a2 ... an是一個字符串，n是字符串的長度，B是多項式的基數，P是大質數。如果你有一個字符串a1a2 ... a的散列值，你可以在常量時間內計算字符串a1a2 ... ana（n + 1）的散列值：（hashValue（a1a2 ... an）* B + a （n + 1））％P.

這部分的複雜度是O（N * M），其中N是字典中的單詞數，M是字典中最長單詞的長度。

然後，使用一個DP函數是這樣的：

bool vis[LENGHT_OF_STRING]; 
    bool go(char str[], int length, int position) 
    { 
     int i; 

     // You found a set of words that can solve your task. 
     if (position == length) { 
      return true; 
     } 

     // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time. 
     if (vis[position]) { 
     return false; 
     } 
     // Mark this position as visited. 
     vis[position] = true; 

     // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary. 
     for (i = position; position < length; i++) { 
     // Calculate the hash value of the substring str(position, i); 
     if (hashValue is in dict) { 
      // You can partition the substring str(i + 1, length) in a set of words in the dictionary. 
      if (go(i + 1)) { 
       // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length). 
       return true; 
      } 
     } 
     } 

     return false; 
    }

的該算法的複雜度是O（N * M），其中N是串的長度，M是最長字的長度在字典或O（N^2）中，取決於你是否編碼了改進。因此算法的總複雜度爲：O（N1 * M）+ O（N2 * M）（或O（N2^2）），其中N1是詞典中詞的數量，M是字典中最長單詞的長度，N2是字符串的長度）。

如果你不能想到一個很好的散列函數（其中沒有任何衝突），其他可能的解決方案是使用Tries或Patricia trie（如果常規trie的大小非常大）（我不能不要發佈這些主題的鏈接，因爲我的聲望不夠高，無法發佈超過2個鏈接）。但是在你使用它的時候，你的算法的複雜度是O（N * M）* O（在trie中找到一個單詞所需的時間），其中N是字符串的長度，M是最長單詞的長度在字典裏。

我希望它有幫助，我爲我可憐的英語道歉。

來源

2012-01-11 07:19:18

如何將給定的文本分解爲字典中的單詞？

回答

相關問題