2017-09-01 58 views
0

我試圖通過使用腳本[數組] +字段聚合用戶在索引中找到重複項。彈性搜索中錯誤的術語總數doc_count

我的問題是爲什麼聚集方面僅由給定鍵([email protected]_SMITH)計數一次文件。是否有可能改變這種行爲。

數據:

POST users/user 
{ 
    "name" :"SMITH", 
    "emails" : [ 
     "[email protected]" 
    ] 
} 

POST users/user 
{ 
    "name" :"SMITH", 
    "emails" : [ 
     "[email protected]", 
     "[email protected]" 
    ] 
} 

DISTINCT查詢:

POST users/_search 
{ 
    "size": 0, 
    "aggs": { 
    "duplicateCount": { 
     "terms": { 
     "script": { 
      "inline": "doc['emails.keyword'].value + '_' + doc['name.keyword'].value" 
     } 
     } 
    } 
    } 
} 

結果:

"aggregations": { 
    "duplicateCount": { 
    "doc_count_error_upper_bound": 0, 
    "sum_other_doc_count": 0, 
    "buckets": [ 
     { 
     "key": "[email protected]_SMITH", 
     "doc_count": 1 
     }, 
     { 
     "key": "[email protected]_SMITH", 
     "doc_count": 1 
     } 
    ] 
    } 
} 
+1

這是因爲'DOC [「emails.keyword」] value'只需要來自'emails'數組的第一個值。我甚至不確定你可以使用'values',因爲腳本術語聚合不能返回兩個術語。 – Val

+0

@Val感謝您的提示。引導我走向正確的方向。 –

+0

Coo,但是我不確定你是否正確地做了它;-)你應該在for循環中只有'keys.add(p);'而不是其他任何東西 – Val

回答

0

你似乎只有拿到與"terms" + "field"權條款聚集計數。

如果您嘗試此查詢,你可以看到"terms" + "field""terms" + "script"的區別:

{ 
    "from" : 0, 
    "size" : 0, 
    "_source" : true, 
    "query" : { 
    "bool" : { 
     "must" : [ { 
     "match" : { 
      "name" : { 
      "query" : "SMITH", 
      "operator" : "OR", 
      "fuzziness" : "AUTO", 
      "prefix_length" : 1, 
      "max_expansions" : 50, 
      "fuzzy_transpositions" : true, 
      "lenient" : false, 
      "zero_terms_query" : "NONE", 
      "boost" : 1 
      } 
     } 
     } ] 
    } 
    }, 
    "aggs": { 
    "duplicateCount": { 
     "terms": { 
     "script": { 
      "inline": "doc['emails.keyword'].value + '_' + doc['name.keyword'].value" 
     } 
     } 
    }, 
    "duplicateCount2": { 
     "terms": { 
     "field": "emails.keyword" 
     } 
    } 
    } 
} 

下面是結果。見duplicateCount2

{ 
    "took" : 53, 
    "timed_out" : false, 
    "_shards" : { 
    "total" : 3, 
    "successful" : 3, 
    "failed" : 0 
    }, 
    "hits" : { 
    "total" : 2, 
    "max_score" : 0.0, 
    "hits" : [ ] 
    }, 
    "aggregations" : { 
    "duplicateCount2" : { 
     "doc_count_error_upper_bound" : 0, 
     "sum_other_doc_count" : 0, 
     "buckets" : [ { 
     "key" : "[email protected]", 
     "doc_count" : 2 
     }, { 
     "key" : "[email protected]", 
     "doc_count" : 1 
     } ] 
    }, 
    "duplicateCount" : { 
     "doc_count_error_upper_bound" : 0, 
     "sum_other_doc_count" : 0, 
     "buckets" : [ { 
     "key" : "[email protected]_SMITH", 
     "doc_count" : 1 
     }, { 
     "key" : "[email protected]_SMITH", 
     "doc_count" : 1 
     } ] 
    } 
    } 
} 
0

好的。所以,我的工作圍繞它通過遍歷數組的術語和手動創建所需的鍵:

def keys = []; 
for (p in doc['emails.keyword'].values) { 
    keys.add(p + doc['name.keyword'].value); 
} 
return keys; 

這裏的結果:

"buckets": [ 
    { 
     "key": "[email protected]_SMITH", 
     "doc_count": 2 
    }, 
    { 
     "key": "[email protected]_SMITH", 
     "doc_count": 1 
    } 
    ] 
+0

Cool,'keys.add(p)'應該是足夠的,但:-) – Val