概述Lucene的:有效載荷和相似功能---總是相同的負載值
我想要實現使用新的有效載荷功能,允許添加元信息文本一個Lucene索引/搜索器。在我的具體情況中,爲了使用它們來覆蓋標準Lucene TF-IDF權重,我將權重(可以理解爲%概率,介於0和100之間)添加到概念標籤。我對這種行爲感到困惑,我相信相似類有一些問題,我重寫了,但我無法弄清楚。
例
當運行的搜索查詢(例如,「的概念:紅」)我發現,每個有效載荷始終是通過MyPayloadSimilarity傳遞所述第一數量(在代碼示例中,這是1.0)而不是1.0,50.0和100.0。結果,所有文件都得到相同的有效載荷和相同的分數。但是,數據應該具有圖片#1,有效載荷爲100.0,接着是圖片#2,接着是圖片#3,分數非常不同。我無法聽到周圍的消息。
下面是運行結果:
Query: concept:red
===> docid: 0 payload: 1.0
===> docid: 1 payload: 1.0
===> docid: 2 payload: 1.0
Number of results:3
-> docid: 3.jpg score: 0.2518424
-> docid: 2.jpg score: 0.2518424
-> docid: 1.jpg score: 0.2518424
什麼是錯的?我誤解了有關Payloads的一些信息嗎?
代碼
附上我分享我的代碼作爲一個獨立的例子,使其儘可能容易讓你運行它,你應該考慮這個選項。
public class PayloadShowcase {
public static void main(String s[]) {
PayloadShowcase p = new PayloadShowcase();
p.run();
}
public void run() {
// Step 1: indexing
MyPayloadIndexer indexer = new MyPayloadIndexer();
indexer.index();
// Step 2: searching
MyPayloadSearcher searcher = new MyPayloadSearcher();
searcher.search("red");
}
public class MyPayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;
MyPayloadAnalyzer(PayloadEncoder encoder) {
this.encoder = encoder;
}
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(reader);
TokenStream filter = new LowerCaseFilter(source);
filter = new DelimitedPayloadTokenFilter(filter, '|', encoder);
return new TokenStreamComponents(source, filter);
}
}
public class MyPayloadIndexer {
public MyPayloadIndexer() {}
public void index() {
try {
Directory dir = FSDirectory.open(new File("D:/data/indices/sandbox"));
Analyzer analyzer = new MyPayloadAnalyzer(new FloatEncoder());
IndexWriterConfig iwconfig = new IndexWriterConfig(Version.LUCENE_4_10_1, analyzer);
iwconfig.setSimilarity(new MyPayloadSimilarity());
iwconfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
// load mappings and classifiers
HashMap<String, String> mappings = this.loadDataMappings();
HashMap<String, HashMap> cMaps = this.loadData();
IndexWriter writer = new IndexWriter(dir, iwconfig);
indexDocuments(writer, mappings, cMaps);
writer.close();
} catch (IOException e) {
System.out.println("Exception while indexing: " + e.getMessage());
}
}
private void indexDocuments(IndexWriter writer, HashMap<String, String> fileMappings, HashMap<String, HashMap> concepts) throws IOException {
Set fileSet = fileMappings.keySet();
Iterator<String> iterator = fileSet.iterator();
while (iterator.hasNext()){
// unique file information
String fileID = iterator.next();
String filePath = fileMappings.get(fileID);
// create a new, empty document
Document doc = new Document();
// path of the indexed file
Field pathField = new StringField("path", filePath, Field.Store.YES);
doc.add(pathField);
// lookup all concept probabilities for this fileID
Iterator<String> conceptIterator = concepts.keySet().iterator();
while (conceptIterator.hasNext()){
String conceptName = conceptIterator.next();
HashMap conceptMap = concepts.get(conceptName);
doc.add(new TextField("concept", ("" + conceptName + "|").trim() + (conceptMap.get(fileID) + "").trim(), Field.Store.YES));
}
writer.addDocument(doc);
}
}
public HashMap<String, String> loadDataMappings(){
HashMap<String, String> h = new HashMap<>();
h.put("1", "1.jpg");
h.put("2", "2.jpg");
h.put("3", "3.jpg");
return h;
}
public HashMap<String, HashMap> loadData(){
HashMap<String, HashMap> h = new HashMap<>();
HashMap<String, String> green = new HashMap<>();
green.put("1", "50.0");
green.put("2", "1.0");
green.put("3", "100.0");
HashMap<String, String> red = new HashMap<>();
red.put("1", "100.0");
red.put("2", "50.0");
red.put("3", "1.0");
HashMap<String, String> blue = new HashMap<>();
blue.put("1", "1.0");
blue.put("2", "50.0");
blue.put("3", "100.0");
h.put("green", green);
h.put("red", red);
h.put("blue", blue);
return h;
}
}
class MyPayloadSimilarity extends DefaultSimilarity {
@Override
public float scorePayload(int docID, int start, int end, BytesRef payload) {
float pload = 1.0f;
if (payload != null) {
pload = PayloadHelper.decodeFloat(payload.bytes);
}
System.out.println("===> docid: " + docID + " payload: " + pload);
return pload;
}
}
public class MyPayloadSearcher {
public MyPayloadSearcher() {}
public void search(String queryString) {
try {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("D:/data/indices/sandbox")));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(new PayloadSimilarity());
PayloadTermQuery query = new PayloadTermQuery(new Term("concept", queryString),
new AveragePayloadFunction());
System.out.println("Query: " + query.toString());
TopDocs topDocs = searcher.search(query, 999);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println("Number of results:" + hits.length);
// output
for (int i = 0; i < hits.length; i++) {
Document doc = searcher.doc(hits[i].doc);
System.out.println("-> docid: " + doc.get("path") + " score: " + hits[i].score);
}
reader.close();
} catch (Exception e) {
System.out.println("Exception while searching: " + e.getMessage());
}
}
}
}
不,還是一樣的結果。 :( – RalfB 2014-10-30 08:18:13
@ralfb,在你的示例代碼中,'MyPayloadSearcher.search'設置了'PayloadSimilarity'(這裏不存在,但可能存在於你的代碼中)而不是'MyPayloadSimilarity'。這可能是你爲什麼沒有看到變化,請確定你正在使用的是哪一種「PayloadSimilarity」課程 – 2014-10-30 18:24:58
哦,天啊!就是這樣!我知道這是愚蠢的,我非常感謝你踢了我的腦子。未來,我還將創建單獨的項目,並確保沙箱代碼是孤立的,而不是在請求複製粘貼錯誤的同一項目中。謝謝Juliano!:) – RalfB 2014-10-30 18:56:38