2011-05-03 139 views
3

如何在使用Lucene模糊搜索時獲得匹配模糊項及其偏移量?如何獲得Lucene模糊搜索結果的匹配條件?

IndexSearcher mem = ....(some standard code) 

    QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer); 

    TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1); 
    // the ~ triggers the fuzzy search as per "Lucene In Action" 

模糊搜索工作正常。如果文檔包含術語「模糊」或「luzzy」,則它是匹配的。如何獲得匹配的術語以及它們的偏移量?

我確定所有的CONTENT_FIELD都添加了termVectorStored以及位置和偏移量。

+1

您是否在尋找東西沿着這些路線的一類? http://lucene.apache.org/java/3_0_0/api/contrib-highlighter/index.html – Jared 2011-05-03 19:05:57

+0

不,我不希望高亮文本;我需要做進一步的文本處理。在做進一步的文本處理之前,我需要找出匹配的術語是「模糊」還是「模糊」等,因爲這是模糊匹配。 – user193116 2011-05-03 19:22:39

回答

6

這樣做沒有直接的方法,但我重新考慮了Jared的建議,並能夠得到解決方案的工作。

我記錄在這裏,以防萬一別人有同樣的問題。

創建一個實現org.apache.lucene.search.highlight.Formatter

public class HitPositionCollector implements Formatter 
{ 
    // MatchOffset is a simple DTO 
    private List<MatchOffset> matchList; 
    public HitPositionCollector(
    { 
     matchList = new ArrayList<MatchOffset>(); 
    } 

    // this ie where the term start and end offset as well as the actual term is captured 
    @Override 
    public String highlightTerm(String originalText, TokenGroup tokenGroup) 
    { 
     if (tokenGroup.getTotalScore() <= 0) 
     { 
     } 
     else 
     { 
      MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset()); 
      getMatchList().add(mo); 
     } 

     return originalText; 
    } 

    /** 
    * @return the matchList 
    */ 
    public List<MatchOffset> getMatchList() 
    { 
     return matchList; 
    } 
} 

主代碼

public void testHitsWithHitPositionCollector() throws Exception 
{ 
    System.out.println(" .... testHitsWithHitPositionCollector"); 
    String fuzzyStr = "bro*"; 

    QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer); 
    Query fzyQry = parser.parse(fuzzyStr); 
    TopDocs hits = searcher.search(fzyQry, 10); 

    QueryScorer scorer = new QueryScorer(fzyQry, "f"); 

    HitPositionCollector myFormatter= new HitPositionCollector(); 

    //Highlighter(Formatter formatter, Scorer fragmentScorer) 
    Highlighter highlighter = new Highlighter(myFormatter,scorer); 
    highlighter.setTextFragmenter(
     new SimpleSpanFragmenter(scorer) 
    ); 

    Analyzer analyzer2 = new SimpleAnalyzer(); 

    int loopIndex=0; 
    //for (ScoreDoc sd : hits.scoreDocs) { 
     Document doc = searcher.doc(hits.scoreDocs[0].doc); 
     String title = doc.get("f"); 

     TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), 
            hits.scoreDocs[0].doc, 
            "f", 
            doc, 
            analyzer2); 

     String fragment = highlighter.getBestFragment(stream, title); 

     System.out.println(fragment); 
     assertEquals("the quick brown fox jumps over the lazy dog", fragment); 
     MatchOffset mo= myFormatter.getMatchList().get(loopIndex++); 

     assertTrue(mo.getEndPos()==15); 
     assertTrue(mo.getStartPos()==10); 
     assertTrue(mo.getToken().equals("brown")); 
} 
+0

雖然這感覺有點_hackish_(不是你的錯,只是覺得應該有一個更清潔的方式),這是我發現的唯一工作實現。謝謝! – 2012-03-08 12:03:46