2012-08-08 271 views
1

我注意到減速機因死機而卡死。在日誌上,它顯示了很多重試消息。是否有可能告訴工作追蹤者放棄死亡節點並恢復工作?有323個mappers和只有1個reducer。我在hadoop-1.0.3上。減速機因死機而卡死

2012-08-08 11:52:19,903 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 65 seconds. 
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress 
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts) 
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-08 11:53:19,905 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 5 seconds. 
2012-08-08 11:53:29,906 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 copy failed: attempt_201207191440_0203_m_000001_0 from 192.168.1.23 
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: java.net.NoRouteToHostException: No route to host 
    at java.net.PlainSocketImpl.socketConnect(Native Method) 
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327) 
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193) 
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180) 
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) 
    at java.net.Socket.connect(Socket.java:546) 
    at sun.net.NetworkClient.doConnect(NetworkClient.java:173) 
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:409) 
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:530) 
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:240) 
    at sun.net.www.http.HttpClient.New(HttpClient.java:321) 
    at sun.net.www.http.HttpClient.New(HttpClient.java:338) 
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935) 
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876) 
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801) 
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618) 
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575) 
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483) 
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394) 
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326) 

2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201207191440_0203_r_000000_0: Failed fetch #18 from attempt_201207191440_0203_m_000001_0 
2012-08-08 11:53:47,907 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 adding host 192.168.1.23 to penalty box, next contact in 1124 seconds 
2012-08-08 11:53:47,907 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0: Got 1 map-outputs from previous failures 
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Need another 63 map output(s) where 0 is already in progress 
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201207191440_0203_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts) 
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-08 11:54:22,909 INFO org.apache.hadoop.mapred.ReduceTask: 192.168.1.23 Will be considered after: 1089 seconds. 

我不要管它,它試了一會兒,然後放棄了死的主機上,然後重新運行映射和成功。這是由主機上的兩個ip引起的,我故意關閉了一個ip,這是hadoop使用的一個IP。

我的問題是,是否有辦法告訴hadoop放棄死亡的主機而不重試。

回答

3

從您的日誌中可以看到,運行地圖任務的任務履行程序之一無法連接到。 Reducer運行的tasktracker試圖通過HTTP協議檢索映射中間結果,並且失敗,因爲具有結果的tasktracker已經死亡。

爲的TaskTracker失敗的默認行爲是這樣的:

的JobTracker的安排是被運行,並在失敗的TaskTracker成功完成重新運行它們是否屬於未完成的作業地圖的任務,因爲他們中間輸出駐留在reduce任務可能無法訪問失敗的tasktracker的本地文件系統。任何正在進行的任務也會重新安排。

問題是,如果一個任務(不管它是一個映射還是一個reduce)失敗太多次(我認爲是4次),它將不會被重新安排並且作業將失敗。 在你的情況下,地圖似乎成功完成,但減速器無法連接到映射器並檢索中間結果。它嘗試4次,之後失敗。

任務失敗,不能完全忽略,因爲它是作業的一部分,除非作業中包含的所有任務都成功,否則作業本身不會成功。

嘗試查找reducer嘗試訪問的鏈接並將其複製到瀏覽器中以查看您遇到的錯誤。

您也可以列入黑名單,並完全從節點列表Hadoop的使用排除節點:

In conf/mapred-site.xml 

    <property> 
    <name>mapred.hosts.exclude</name> 
    <value>/full/path/of/host/exclude/file</value> 
    </property> 

    To reconfigure nodes. 

    /bin/hadoop mradmin -refreshNodes 
+0

謝謝!在我的情況下,我把它放在一邊,重試了一段時間,然後放棄了死去的主機,重新運行映射器併成功。這是由主機上的兩個IP地址造成的,我故意關閉了一個ip,這是hadoop使用的一個ip。我的問題是,是否有辦法告訴hadoop在不重試的情況下放棄死亡的主機。 – 2012-08-13 07:01:06

+0

可能編輯可能有幫助 – Razvan 2012-08-13 11:40:16

+0

如果這確實是Hadoop的預期行爲,那麼這是非常令人不滿的。硬件一直失敗。 Hadoop被設計爲可以抵禦硬件故障。當由於有限的硬件故障導致作業失敗時,這表明Hadoop中存在設計缺陷。 – jhclark 2013-10-04 17:24:34