2017-02-09 70 views
0

mesos-slave它無法更新屬性(isolation)後重新註冊:不能重新註冊mesos劑

6868 status_update_manager.cpp:177] Pausing sending status updates 
6877 slave.cpp:915] New master detected at [email protected]:5050 
6867 status_update_manager.cpp:177] Pausing sending status updates 
6877 slave.cpp:936] No credentials provided. Attempting to register without authentication 
6877 slave.cpp:947] Detecting new master 
6869 slave.cpp:1217] Re-registered with master [email protected]:5050 
6866 status_update_manager.cpp:184] Resuming sending status updates 
6869 slave.cpp:1253] Forwarding total oversubscribed resources {} 
6874 slave.cpp:4141] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration. 
6874 slave.cpp:904] Re-detecting master 
6874 slave.cpp:947] Detecting new master 
6874 status_update_manager.cpp:177] Pausing sending status updates 
6869 status_update_manager.cpp:177] Pausing sending status updates 
6871 slave.cpp:915] New master detected at [email protected]:5050 
6871 slave.cpp:936] No credentials provided. Attempting to register without authentication 
6871 slave.cpp:947] Detecting new master 
6872 slave.cpp:1217] Re-registered with master [email protected]:5050 
6872 slave.cpp:1253] Forwarding total oversubscribed resources {} 
6871 status_update_manager.cpp:184] Resuming sending status updates 
6871 slave.cpp:4141] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration. 

這似乎是停留在一個無限循環。任何想法如何開始新鮮的奴隸?我試圖刪除work_dir並重新啓動mesos-slave過程,但沒有任何成功。

該情況是由意外重命名爲work_dir引起的。重新啓動mesos-slave後,它無法重新連接,也無法停止正在運行的任務。我試圖從機上使用cleanup

echo 'cleanup' > /etc/mesos-slave/recover 
service mesos-slave restart 
# after recovery finishes 
rm /etc/mesos-slave/recover 
service mesos-slave restart 

這部分幫助,但還是有很多殭屍任務馬拉松,因爲Mesos主無法檢索有關任務的任何信息。當我查看指標時,我發現有些奴隸被標記爲「無效」。

UPDATE:在主日誌中出現以下:

Cannot kill task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c 
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) at 
[email protected]:52192 
because the agent cac09818-0d75-46a9-acb1-4e17fdb9e328-S10 at 
slave(1)@192.168.1.1:5051 (w10.example.net) is disconnected. 
Kill will be retried if the agent re-registers 

重新啓動當前mesos-master後:

Cannot kill task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c 
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) 
at [email protected]:39972 
because it is unknown; performing reconciliation 

Performing explicit task state reconciliation for 1 tasks 
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) 
at [email protected]:39972 

Dropping reconciliation of task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c 
for framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) 
at [email protected]:39972 
because there are transitional agents 
+0

你可以附上主日誌嗎? – janisz

+0

我在主日誌中找不到任何相關內容。它看起來像mesos標記爲舊奴隸非活動,它仍在等待他們的恢復。 – Tombart

回答

0

腦分裂情況是由具有多於一個work_dir引起的。在大多數情況下,它可能足以將數據從不正確的work_dir移動:

mv /tmp/mesos/slaves/* /var/lib/mesos/slaves/ 

然後強制重新登記:

rm -rf /var/lib/mesos/meta/slaves/latest 
service mesos-slave restart 

目前正在運行的任務將無法生存(不會被回收)。舊執行者的任務應標記爲TASK_LOST並計劃清理。這將避免殭屍任務的問題,Mesos無法殺死(因爲他們運行在不同的work_dir)。

如果mesos-slave仍被註冊爲非活動狀態,請重新啓動當前的Mesos主設備。