爲什麼索引器不搜索波斯文件？

我使用lucene 3來索引一些像這樣的txt文件。爲什麼索引器不搜索波斯文件？

public static void main(String[] args) throws Exception { 

    String indexDir = "file input"; 
    String dataDir = "file input"; 
    long start = System.currentTimeMillis(); 

    indexer indexer = new indexer(indexDir); 
    int numIndexed, cnt; 
    try { 
     numIndexed = indexer.index(dataDir, new TextFilesFilter()); 

     cnt = indexer.getHitCount("mycontents", "شهردار"); 
     System.out.println("count of search in contents: " + cnt); 
    } finally { 
     indexer.close(); 
    } 
    long end = System.currentTimeMillis(); 
    System.out.println("Indexing " + numIndexed + " files took " 
      + (end - start) + " milliseconds"); 

}

getHitCount函數返回英文單詞的點擊次數，但通過波斯語單詞返回零！

public int getHitCount(String fieldName, String searchString) 
     throws IOException, ParseException { 

    IndexSearcher searcher = new IndexSearcher(directory); 

    Term t = new Term(fieldName, searchString); 
    Query query = new TermQuery(t); 

    int hitCount = searcher.search(query, 1).totalHits; 
    searcher.close(); 
    return hitCount; 
}

如何在我的項目中設置utf-8？我使用netbeans並創建一個簡單的java項目。我只需要一個簡單的文件搜索！

這是我的索引類：

private IndexWriter writer; 
private Directory directory; 

public indexer(String indexDir) throws IOException { 
    directory = FSDirectory.open(new File(indexDir)); 
    writer = new IndexWriter(directory, 
      new StandardAnalyzer(
        Version.LUCENE_30), 
      true, 
      IndexWriter.MaxFieldLength.UNLIMITED); 
} 

public void close() throws IOException { 
    writer.close(); 
} 

public int index(String dataDir, FileFilter filter) 
     throws Exception { 
    File[] files = new File(dataDir).listFiles(); 
    for (File f : files) { 
     if (!f.isDirectory() 
       && !f.isHidden() 
       && f.exists() 
       && f.canRead() 
       && (filter == null || filter.accept(f))) { 
      indexFile(f); 
     } 
    } 
    return writer.numDocs(); 
} 

private static class TextFilesFilter implements FileFilter { 

    public boolean accept(File path) { 
     return path.getName().toLowerCase() 
       .endsWith(".txt"); 
    } 
} 

protected Document getDocument(File f) throws Exception { 
    Document doc = new Document(); 
    doc.add(new Field("mycontents", new FileReader(f))); 
    doc.add(new Field("filename", f.getName(), 
      Field.Store.YES, Field.Index.NOT_ANALYZED)); 
    doc.add(new Field("fullpath", f.getCanonicalPath(), 
      Field.Store.YES, Field.Index.NOT_ANALYZED)); 
    return doc; 
} 

private void indexFile(File f) throws Exception { 
    System.out.println("Indexing " + f.getCanonicalPath()); 
    Document doc = getDocument(f); 
    writer.addDocument(doc); 
}

來源

2016-02-05 NASRIN

我們可以看到你的索引類？這似乎是你自己實施的東西 – Niklas

@Niklas我編輯了我的問題。 – NASRIN

這會幫助你：http://stackoverflow.com/questions/23030329/lucene-encoding-java – Niklas

我懷疑，這個問題是不是Lucene的編碼本身，而是FileReader。從FileReader文檔：

此類的構造函數假定默認字符編碼和默認字節緩衝區大小是適當的。

默認的字符編碼可能是不恰當的，在這種情況下。

相反的：

doc.add(new Field("mycontents", new FileReader(f)));

嘗試（假設要建立索引的文件是UTF-8編碼）：

doc.add(new Field("mycontents", new InputStreamReader(new FileInputStream(f), "UTF8")));

來源

2016-02-05 17:03:22 femtoRgon

爲什麼索引器不搜索波斯文件？

回答

相關問題