2016-10-04 145 views
0

我想在java中構建倒排索引。我有1400個文本文件的cran數據。 我能夠計算每個術語/單詞的頻率。我已經能夠返回一個單詞出現在整個集合中的次數,但我一直無法返回該單詞出現在哪個文檔中。這是迄今爲止的代碼:如何在java中創建倒排索引

我希望輸出以下形式 TERM1:DOC1:2,DOC2:3 TERM2:DOC1:3,DOC4:1 ...............等

這裏術語是一個字在一個doc文件和文檔1:2表示TERM1出現在文檔1 2倍

public static void main(String[]args) throws FileNotFoundException{ 
     Map<String, Integer> m = new HashMap<>(); 

     String wrd; 

     for(int i=1;i<=2;i++){ 
      //FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt"); 
      Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt")); 
      while(tdsc.hasNext()){ 
       // m.clear(); 
       Integer docid=i; 

       wrd=tdsc.next(); 
       //Vector<Integer> vPosList = p.hPosList.get(wrd); 
       Integer freq=m.get(wrd); 

       //Integer doc=m1.get(i); 
       //System.out.println(m.get(wrd)); 
       m.put(wrd, (freq == null) ? 1 : freq + 1); 
      } 

      System.out.println(m.size() + " distinct words" + " steem" +i); 
      System.out.println("Doc" +i+""+m); 
      //System.out.println("Doc"+i+""+m1); 
      m.clear(); 


     tdsc.close(); 

    } 
     //System.out.println(m.size() + " distinct words"); 
     //System.out.println(m); 
     // System.out.println(m1); 

} 
} 
+0

HTTP:// ST ackoverflow.com/questions/12511543/how-to-build-a-simple-inverted-index –

回答

0
public static void main(String[]args) throws FileNotFoundException{ 
    Map<String, Set<Doc>> wordDocMap = new HashMap<>(); 

    for(int i=1;i<=2;i++){ 
     Scanner tdsc = new Scanner(new File("D:\\logs\\steem"+i+".txt")); 
     Doc document = new Doc("doc"+i); 
     while(tdsc.hasNext()){ 
      String word = tdsc.next(); 
      document.put(word); 
      Set<Doc> documents = wordDocMap.get(word); 
      if(documents == null){ 
       documents = new HashSet<>(); 
       wordDocMap.put(word, documents); 
      } 
      documents.add(document); 
     } 
     tdsc.close(); 
    } 

    StringBuilder builder = new StringBuilder(); 
    for(String word: wordDocMap.keySet()) { 
     Set<Doc> documents = wordDocMap.get(word); 
     builder.append(word + ":"); 
     for(Doc document:documents){ 
      builder.append(document.getDocName() +":"+ document.getCount(word)); 
      builder.append(", "); 
     } 
     builder.delete(builder.length()-2, builder.length()-1); 
     builder.append("\n"); 
    } 
    System.out.println(builder); 
} 

static class Doc { 
    String docName; 
    Map<String, Integer> m = new HashMap<>(); 

    public Doc(String docName){ 
     this.docName = docName; 
    } 

    public void put(String word) { 
     Integer freq = m.get(word); 
     m.put(word, (freq == null) ? 1 : freq + 1); 
    } 

    public Integer getCount(String word) { 
     return m.get(word); 
    } 

    public String getDocName() { 
     return this.docName; 
    } 
} 
+0

如果我沒有錯,我需要打印地圖「wordDocMap」 所以當我把行System.out.println( worddocMap); 它只顯示{}這個值 我是地圖上的新手,所以如果你告訴我發生了什麼事情會很有幫助? –

+0

如何打印輸出? –

+0

您可以打印StringBuilder構建器 –