彈性搜索：在大型數據集上性能較差

我有7個節點具有2個索引的彈性搜索集羣，並且都具有嵌套的對象映射。我被延遲插入到索引2（通過火花流）。我正在使用批量插入，每個批次需要〜8-12s（〜100k記錄）。彈性搜索：在大型數據集上性能較差

Node Configuration: 
RAM: 64 GB 
Core: 48 
HDD : 1 TB 
JVM allocated Memory: 32 GB 

Marvel Node Status: 
CPU Usages: ~10-20% 
JVM Memory: ~60-75% 
Load Average : ~3-35 
Indexing Rate: ~10k/s 
Search Rate: ~2k/s 

Index1 (Replication 1): 
Status: green 
Documents: 84.4b 
Data: 9.3TB 
Total Shards: 400 (Could it be the reason of low performance) 

Index2 (Replication 1): 
Status: green 
Documents: 1.4b 
Data: 35.8GB 
Total Shards: 10 
Unassigned Shards: 0 

Spark streaming configuration: 
executors: 2 
Executor core per executor: 8 
Number of partition: 16 
batch size: 10s 
Event per batch: ~1k-200k 
Elastic Bulk insert count: 100k

索引2映射：

{ 
    "settings": { 
    "index": { 
     "number_of_shards": 5, 
     "number_of_replicas": 1 
    } 
    }, 
    "mappings": { 
    "parent_list": { 
     "_all": { 
     "enabled": false 
     }, 
     "properties": { 
     "parents": { 
      "type": "nested", 
      "properties": { 
      "parent_id": { 
       "type": "integer", 
       "doc_values": false 
      }, 
      "childs": { 
       "type": "nested", 
       "properties": { 
       "child_id": { 
        "type": "integer", 
        "doc_values": false 
       }, 
       "timestamp": { 
        "type": "long", 
        "doc_values": false 
       }, 
       "is_deleted": { 
        "type": "boolean", 
        "doc_values": false 
       } 
       } 
      } 
      } 
     }, 
     "other_ID": { 
      "type": "string", 
      "index": "not_analyzed", 
      "doc_values": false 
     } 
     } 
    } 
    } 
}

我的查詢：

獲取數由父ID與至少一個孩子IS_DELETED假。
通過is_deleted爲false的子ID獲取計數。通過_id

獲取的文件，我期待從彈性更高的性能，但它成爲我的系統瓶頸。 有人可以建議性能調整？使用此羣集配置，我們可以通過Elastic實現更高的插入率嗎？

來源

2016-12-30 Nishant Kumar

100K文件的批量處理呢聽起來很像。你可以降低並再試一次嗎？ –

我嘗試了10k，但是並沒有提高很多 –

@AndreiStefan Index1有400個分片。這可能是低績效的原因嗎？預期的插入率應該是多少？ –

你的問題不在配置上可能是在硬件層面。

嘗試禁用throtling

PUT /_cluster/settings 
{ 
    "transient" : { 
     "indices.store.throttle.type" : "none" 
    } 
}

關掉副本 - > 0 下碎片到最大的2-3個節點的量（400 ridicusly危險）

變化的刷新速率爲-1指數化

PUT /{INDICE}/_settings 
{ 
    "index" : { 
     "refresh_interval" : "-1" 
    } 
}

負載平衡服務器之間的大部分請求期間（節點）

使用持久連接如通過插座

確保你沒有運行到網絡的瓶頸

關於100K的文件批量請求，這取決於每個文件的大小，甜美spoot始終圍繞4 -5k。爲什麼？由於批量API不會立即插入數據，它首先將其緩存，然後將其轉儲到磁盤中，如果您完成發送太大批量的緩存，則會遇到棘手問題。

如果你正在使用持久連接，你不需要擔心你的批量api的大小，你可以打開一個套接字並開始發送一批文件，它的速度可以和它做的一樣快。（因爲它並不需要處理的直升機每次可節省您每次往返50毫秒）

任何其他問題，我知道這是有點晚了，但希望有人發現了它有用的一個somepoint

來源

2017-06-09 14:36:36

彈性搜索：在大型數據集上性能較差

回答

相關問題