2015-09-26 69 views
1

我使用Hadoop 2.2.0 HA自動故障轉移不起作用。 這是我的配置。Hadoop的HA活躍NN保持崩潰,並

芯-site.xml中

<property> 
    <name>ha.zookeeper.quorum</name> 
    <value>zk01.bi.lietou.inc:2181,zk02.bi.lietou.inc:2181,zk03.bi.lietou.inc:2181</value> 
</property> 
<property> 
    <name>ipc.client.connect.timeout</name> 
    <value>20000</value> 
</property> 

HDFS-site.xml中

<property> 
    <name>dfs.nameservices</name> 
    <value>lynxcluster</value> 
</property> 
<property> 
    <name>dfs.ha.namenodes.lynxcluster</name> 
    <value>nn1,nn2</value> 
</property> 
<property> 
    <name>dfs.namenode.rpc-address.lynxcluster.nn1</name> 
    <value>192.168.30.133:2020</value> 
</property> 
<property> 
    <name>dfs.namenode.rpc-address.lynxcluster.nn2</name> 
    <value>192.168.30.129:2020</value> 
</property> 
<property> 
    <name>dfs.namenode.http-address.lynxcluster.nn1</name> 
    <value>192.168.30.133:2070</value> 
</property> 
<property> 
    <name>dfs.namenode.http-address.lynxcluster.nn2</name> 
    <value>192.168.30.129:2070</value> 
</property> 
<property> 
    <name>dfs.namenode.shared.edits.dir</name> 
    <value>qjournal://192.168.30.134:8485;192.168.30.135:8485;192.168.30.136:8485/mycluster</value> 
</property> 
<property> 
    <name>dfs.client.failover.proxy.provider.lynxcluster</name> 
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 
</property> 
<property> 
    <name>dfs.ha.automatic-failover.enabled</name> 
    <value>true</value> 
</property> 
<property> 
    <name>dfs.qjournal.write-txns.timeout.ms</name> 
    <value>6000000</value> 
</property> 

192.168.30.129是活性NN,並且192.168.30.133是待機NN。

活躍NN出現錯誤時啓動日誌段。

2015-09-26 22:09:06,044 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 192.168.30.133 
2015-09-26 22:09:06,044 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 
2015-09-26 22:09:06,044 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 522618707 
2015-09-26 22:09:06,179 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 16 Total time for transactions(ms): 7 Number of transactions batched in Syncs: 0 Number of syncs: 10 SyncTimes(ms): 670 2033 142 
2015-09-26 22:09:06,185 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data1/hadoop/name/current/edits_inprogress_0000000000522618707 -> /data1/hadoop/name/current/edits_0000000000522618707-0000000000522618722 
2015-09-26 22:09:06,294 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizingedits file /data2/hadoop/name/current/edits_inprogress_0000000000522618707 -> /data2/hadoop/name/current/edits_0000000000522618707-0000000000522618722 
2015-09-26 22:09:06,307 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 522618723 
2015-09-26 22:09:12,308 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:13,310 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:14,310 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:15,312 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:16,312 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10005 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:17,313 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11006 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:18,314 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12007 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:19,315 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13008 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:20,317 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14009 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:21,317 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15010 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:22,319 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16011 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:23,319 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17012 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:24,321 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18013 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:25,321 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19014 ms (timeout=20000 ms) for a response for startLogSegment(522618723). Succeeded so far: [192.168.30.134:8485] 
2015-09-26 22:09:26,308 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 522618723 failed for required journal (JournalAndStream(mgr=QJM to [192.168.30.134:8485, 192.168.30.135:8485, 192.168.30.136:8485], stream=null)) 
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. 
    at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) 
    at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:387) 
    at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:91) 
    at org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:199) 
    at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:352) 
    at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:196) 
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1029) 
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:998) 
    at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1082) 
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5050) 
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:832) 
    at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) 
    at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214) 
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) 
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) 
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) 
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:415) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) 
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042) 
2015-09-26 22:09:26,312 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 
2015-09-26 22:09:26,319 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at lynx001-bi-30-129.liepin.inc/192.168.30.129 
************************************************************/ 

129關閉後,133仍處於待機狀態。 待機NN​​日誌

2015-09-26 22:09:27,651 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN 
java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: "lynx-bi-30-133.liepin.inc/192.168.30.133"; destination host is: "lynx001-bi-30-129.liepin.inc":2020; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1351) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1300) 
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) 
    at com.sun.proxy.$Proxy11.rollEditLog(Unknown Source) 
    at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139) 
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:268) 
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61) 
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:310) 
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279) 
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296) 
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) 
    at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292) 
Caused by: java.io.EOFException 
    at java.io.DataInputStream.readInt(DataInputStream.java:392) 
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:995) 
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:891) 

權這個錯誤之前,它開始灑出

2015-09-26 22:03:00,941 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 
2015-09-26 22:03:00,941 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 2020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 192.168.30.131:35882Call#7495335 Retry#0: error: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 
2015-09-26 22:03:01,135 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 
2015-09-26 22:03:01,135 INFO org.apache.hadoop.ipc.Server: IPC Server handler 45 on 2020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 192.168.30.131:35886 Call#7495346 Retry#0: error: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 
2015-09-26 22:03:06,050 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 
2015-09-26 22:03:06,050 INFO org.apache.hadoop.ipc.Server: IPC Server handler 19 on 2020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 192.168.30.131:35891 Call#1 Retry#0: error: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 

JN日誌

2015-09-26 22:09:44,395 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.startLogSegment from 192.168.30.129:30015 Call#157803 Retry#0: output error 
2015-09-26 22:09:45,400 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception 
java.nio.channels.ClosedChannelException 
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) 
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) 
    at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2577) 
    at org.apache.hadoop.ipc.Server.access$2200(Server.java:122) 
    at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1011) 
    at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1076) 
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2104) 

我試圖增加IPC超時60秒,但它不起作用。

+0

你找到一個解決辦法? – Mouette

回答

0

我相信它使用dfs.qjournal.start-segment.timeout.ms。默認情況下,這是20000

但是也有其他CONFIGS以及你可能需要調整像dfs.qjournal.write-txns.timeout.ms。

但是,您最好修改您的基礎結構問題,而不是更改這些默認值。 似乎有很多屬性定義NameNodes如何管理其各種類型的連接以及JouralManagers的超時。

就我而言,我加入到HDFS-site.xml中的以下自定義屬性

dfs.qjournal.start-segment.timeout.ms = 90000 dfs.qjournal.select-input-streams.timeout.ms = 90000 dfs.qjournal.write-txns.timeout.ms = 90000

我還增加了以下屬性核心的site.xml

ipc.client.connect.timeout = 90000 到目前爲止,這似乎緩解了這個問題。