2014-11-24 201 views
2

在使用三個節點測試Core Os羣集時,在成功添加和刪除少量附加節點之後,我遇到了以下問題,據推測,這是由於etcd選舉過程中的爭用情況造成的。如何解決etcd領導選舉中的競爭狀態?

檢查新領導人給出:

$ curl -L http://127.0.0.1:4001/v2/stats/leader 
{"errorCode":300,"message":"Raft Internal Error","index":629006} 

Journalctl對於集羣中的每一臺機器給:

$ journalctl -r -u etcd 
-- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. -- 
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.307 INFO  | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221 started. 
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.306 INFO  | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'. 
Nov 24 10:47:33 node-1 etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO  | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'. 
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO  | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219 started. 
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO  | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'. 
Nov 24 10:47:31 node-1 etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO  | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'. 

並與船隊上市的機器出現故障:

$ fleetctl list-machines 
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms 
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms 
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 

清單的集羣中的機器給出:

$ curl -L http://127.0.0.1:7001/v2/admin/machines 
[{"name":"","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"}, 
{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"}, 
{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"}, 
{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"}, 
{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"}, 
{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}] 

但目前只有三臺機器正在運行。我試圖添加額外的機器來達到法定人數而無濟於事。 我運行以下版本:

$ etcdctl -v 
etcdctl version 0.4.6 

爲此,這裏https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config提到,領導模塊,迫使領導者已被刪除。醜陋的部分是,因爲沒有法定人數,我無法從計算機列表中刪除當前沒有使用例如運行的:

$ curl -L -XDELETE http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6 

它有沒有效果。有什麼方法可以保存集羣嗎?

回答

1

在我的理解中,您可以保存集羣,但這不值得。

羣集不接受新機器,因爲它需要仲裁才能添加新機器,並且沒有現有機器的法定人數。刪除機器和刪除密鑰也是一樣。

如果您可以啓動足夠多的作爲集羣成員列出的機器並使它們成功地作爲集羣成員工作,那麼您將擁有仲裁併保存集羣。

從我所看到的情況來看,有六臺機器被列爲集羣成員。您需要至少運行四個現有羣集才能運行。