2012-07-12 47 views
2

我試圖從文本文件中提取的模式中提取ngrams, 模式具有不同數量的術語。從模式中提取n-grams

例如: 如果模式p = {T1,T2,T3}

,我們需要提取NGRAM 3

它應該是這樣的

t1 
t2 
t3 

t1, t2 
t2,t3 

t1,t2,t3 

我寫了一些代碼但效果不佳。

 public Hashtable<String, Pattern> findGrams(XMLDocument d) { 
    ArrayList<Pattern> patterns = d.getPatterns(); 

    System.out.println("patterns " + d.getPatterns()); 

    ArrayList terms = new ArrayList(); 
    Hashtable Grams = new Hashtable(); 

    String s = ""; 

    // to extract all terms from the pattern 
    for (int i = 0; i < patterns.size(); i++) { 
     Pattern pat = (Pattern) patterns.get(i); 
     terms.clear(); 
     for (int z = 0; z < pat.getNumitems(); z++) { 
      terms.add(pat.getItems().get(z).toString()); 
     } 

     // to generate grams from the pattern 
     int j = 0; 
     int x=0; 
     for (int y = 1; y <= ngram ; y++) { 

      for (x = 0; x < terms.size() & j != y; x++) { 
        s = terms.get(x).toString(); 

        if (y > 1) { 
         for (j = x + 1; j < terms.size() & terms.indexOf(j) < ngram; j++) { 
          s = s + "," + terms.get(j).toString(); 
         } 
        } 

        if (!Grams.contains(s)) { 
         System.out.println(s); 
         Grams.put(s, i); 
        } 
       } 

     } 
    } 
    return (Grams); 
} 

任何幫助,請

+1

我發現很難弄清楚你在這裏要求什麼。你能提供一個具體的例子給出一個給定的輸入和預期的輸出嗎? – 2012-07-12 12:56:42

+0

例如:如果該圖案P = {T1,T2,T3} 和需要提取的ngram 3 它應該是這樣的 爲的ngram 1:T1然後T2然後T3 爲的ngram 2:T1, t2然後t2,t3 對於ngram 3:t1,t2,t3 – Mubarak 2012-07-12 13:02:39

+0

這功課?它可能是http://stackoverflow.com/questions/3656762/n-gram-generation-from-a-句子的副本? – radimpe 2012-07-12 13:03:05

回答

0

我希望這會給你想要你需要的。

import java.util.*; 

public class Test { 

    public static List<String> ngrams(int n, String str) { 
     List<String> ngrams = new ArrayList<String>(); 
     String[] words = str.split(" "); 
     for (int i = 0; i < words.length - n + 1; i++) 
      ngrams.add(concat(words, i, i+n)); 
     return ngrams; 
    } 

    public static String concat(String[] words, int start, int end) { 
     StringBuilder sb = new StringBuilder(); 
     for (int i = start; i < end; i++) 
      sb.append((i > start ? " " : "") + words[i]); 
     return sb.toString(); 
    } 

    public static void main(String[] args) { 
     for (int n = 1; n <= 3; n++) { 
      for (String ngram : ngrams(n, "t1 t2 t3")) 
       System.out.println(ngram); 
      System.out.println(); 
     } 
    } 
}