Logstash - 如何防止加載重複記錄

-1

我們有一個簡單的索引叫僱員，其中我們只有2個字段firstname，lastname。使用logstash腳本，我們加載我們的員工數據。即使我們在數據文件中有重複項，我們也不想將重複記錄存儲到索引中。在這種情況下，如果名字+姓氏相同，則不應將記錄添加到索引中。Logstash - 如何防止加載重複記錄

logstash script is: 

input { 
    file { 
     path => "C:/employees.csv" 
    } 
    } 
filter { 
    csv { 
     columns => [ 
      "firstname", 
      "lastname" 
     ] 
     separator => "," 
     } 
    } 
output { 
elasticsearch{ 
    hosts => ["localhost:9200"] 
    index => "employees" 
    } 
} 

data file - employees.csv 

john,doe 
jane,doe 
john,doe - this record should not be added to the index. 

I went through lot of documentation and searched a lot for adding conditions in the filter clause. however, no luck so far. 

Can any one provide inputs on this. 

thanks

來源

2017-05-06 Srinivas KK

這聽起來像你正在尋找Elasticsearch映射_id字段。如果您基於每行姓氏/名字（或類似名稱）的散列來設置該字段，則應避免插入重複數據。

Elasticsearch的行數是autogenerating unique ids，如果你沒有指定你想要的_id是什麼。

編輯： 如果姓氏 + 姓是數據集中足夠的唯一

... 
output { 
    elasticsearch { 
     hosts => ["localhost:9200"] 
     index => "employees" 
     _id => "%{lastname}%{firstname}" 
    } 
}

來源

2017-05-06 07:25:22 Brett

你能提供給我的語法來創建基於姓+名的哈希索引。謝謝 –

謝謝，它工作完美。 –

接受它作爲答案。我還有兩個問題。 1）如果我得到與其他領域改變相同的記錄，記錄不會更新。基本上，我正在尋找upsert。例如：他的名字，工資改變了，那麼現有記錄需要更新。有關於此的任何投入？ 2）使用哈希ID（名字+姓氏）而不是彈性自動生成的ID會有任何性能開銷嗎？非常感謝 –

Logstash - 如何防止加載重複記錄

回答

相關問題