如何在文本中查找複合詞的出現

我正試圖在文本中查找特定單詞或複合詞的出現。如何在文本中查找複合詞的出現

例如，文本是「對你生日快樂」而我必須匹配的短語是「生日快樂」。

我有一個單詞/短語詞典需要與輸入文本進行匹配。這本詞典由大約3000字/複合詞組成。需要分析的文本數量可能會有所不同。現在我正在使用正則表達式。 \ b +短語+ \ b。。這給了我正確的答案，但速度很慢。

此外，在文本中找到的單詞前面或後面可能有特殊字符，例如！，：，。等等。

儘管text.contains（）速度很快，但我無法使用它，因爲即使是單詞的子集，它也會返回true。有什麼辦法可以更快地做到這一點？

來源

2013-04-08 Tazo

爲什麼你就不能使用'text.contains（）'？這個詞的子集是什麼意思？ – Howard 2013-04-08 09:34:32

你在哪裏存儲字典？ – 2013-04-08 09:35:30

就像說，我想找到的詞是作者，然後包含即使對於錯誤的權威也會返回true。 – Tazo 2013-04-08 09:36:08

你可以在字符串分割到詞的數組，並使用Knuth-Morris-Pratt algorithm，但不是在比較字符串的字符，你在一個數組比較的話。

例如，字符串：

i bought a hat in manhattan

它分成數組：

S = {"i","bought","a","hat","in","manhattan"}

如果你正在尋找一個字，簡單地比較你正在尋找的每一個字的字在這個數組中。

如果您正在尋找的字序列，例如：

W = {"a","hat","in"}

使用KMP。明確地說，指的算法由維基百科定義，集合S和W如上，當算法狀態if W[i] = S[m + i]，你實現這個在java中的：從下面的網址

if(W[i].equals(S[m+i]))

來源

2013-04-08 10:16:58

嘿，我正在用KMP和Boyer-Moore來試試這個。但是，這兩個算法都返回true，即使單詞與其他字母一起存在。例如。說，我想找到帽子和文本包含男人「帽子」棕褐色，然後算法返回true.Any線索如何處理這個不使用正則表達式？ – Tazo 2013-04-18 05:40:31

關鍵是首先將字符串拆分爲一個單詞數組，我將添加一個示例。 – 2013-04-19 06:17:31

試試這個：（「」+ test +「」）.contains（「」+ phrase +「」）;

這應包括三個條件 -

當測試字符串短語開始或結束機智短語，仍然是我們包含將查找字符串。當在中間的短語，它會找到這個短語。當短語包含空格，但我們仍然是很好...

想不出任何其他情況下的...

來源

2013-04-08 10:07:02

：Thanks.This將適用於沒有特殊字符，如！ – Tazo 2013-04-08 10:20:19

我已經使用了很多indexOf()和java.lang.Stringsubstring()方法，這可降低性能的代碼，但下面的代碼可以作爲邁向這種方法的第一步。

public class MultiWordCompare { 

    private static boolean containsWord(String word, String search) { 
     if(word.indexOf(search) >= 0) { // Try if the word first exists at all 
      try { 
       String w = word.substring(word.indexOf(search), word.indexOf(search)+search.length()+1); //+1 to capture possible space 
       if(w.lastIndexOf(" ") == w.length()-1) { //if the last char is space, then we captured the whole word 
        w = w.substring(0, w.length()-1); //remove space 
        return w.equals(search); //do string compare 
       } 
      } 
      catch(Exception e) { 
       //catching IndexOutofBoundException 
      } 
     } 
     return false; 
    } 

    public static void main(String [] args) { 
     System.out.println(containsWord("New York is great!", "New York")); 
     System.out.println(containsWord("Many many happy Returns for the day", "happy Returns")); 
     System.out.println(containsWord("New Authority", "New Author")); 
     System.out.println(containsWord("New York City is great!", "N Y C")); 
    } 

}

而這裏的輸出

true 
true 
false 
false

來源

2013-04-08 10:10:40 sanbhat

對不起，我想我錯過了這個問題;但它很有可能是我的文本有像紐約這樣的特殊字符！很棒，甚至我喜歡紐約 – Tazo 2013-04-08 10:26:43

正如我所說，這是不是完整的解決方案，只是一種方法。您可以開始使用這段最初的代碼添加更多處理。大概我假設在每個單詞之後都有'space'..你可以增強這個代碼來處理更多的東西 – sanbhat 2013-04-08 10:28:44

 String text = 
       "This is the text to be searched " + 
       "for occurrences of the http:// pattern."; 

    String patternString = "This is the"; 

    Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE); 
    Matcher matcher = pattern.matcher(text); 

    System.out.println("lookingAt = " + matcher.lookingAt()); 
    System.out.println("matches = " + matcher.matches());

來源。有關更多詳情，請查看下面的網址一次。

Matcher

來源

2013-04-19 06:32:26 VKPRO

如何在文本中查找複合詞的出現

回答

相關問題