2015-06-22 109 views
4

我有一個Oozie作業,我從java客戶端開始,它在START動作中卡住,它表示它處於RUNNING狀態,但START節點處於PREP狀態。Oozie作業卡在PREP狀態的START動作

爲什麼是這樣以及如何解決這個問題?

Oozie工作流程僅包含一個java動作。羣集上的Hadoop版本爲2.4.0,羣集上的Oozie爲4.0.0。

這裏是workflow.xml

<workflow-app xmlns='uri:oozie:workflow:0.2' name='java-filecopy-wf'> 
<start to='java1'/> 
    <action name='java1'> 
    <java> 
     <job-tracker>${jobTracker}</job-tracker> 
     <name-node>${nameNode}</name-node> 
     <configuration> 
      <property> 
       <name>mapred.job.queue.name</name> 
       <value>default</value> 
      </property> 
     </configuration>   
     <main-class>testingoozieclient.Client</main-class> 
     <capture-output/> 
    </java> 
    <ok to="end" /> 
    <error to="fail" /> 
</action> 
<kill name="fail"> 
    <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}] 
     </message> 
</kill> 
<end name='end' /> 

這裏是Java客戶端

OozieClient oozieClient = new OozieClient(args[0]); 

    Properties conf = oozieClient.createConfiguration(); 
    conf.setProperty(OozieClient.APP_PATH, args[1]); 

    conf.setProperty("nameNode", args[2]); 
    conf.setProperty("jobTracker", args[3]); 

    String jobId = null; 

    try{ 
     jobId = oozieClient.run(conf); 
    } 
    catch(OozieClientException ex){ 
     Logger.getLogger(Client.class.getName()).log(Level.SEVERE, null, ex); 

    } 

因爲我想這幾次,現在有5,6工作流都以RUNNING作爲狀態,但是當我通過Web界面查看它時,我可以看到它們全都卡在PREP狀態下的START節點上?


在一些提交的工作流程被殺後,我能夠啓動另一個工作流程。這一次工作流程從開始到java操作,但以類似的方式卡在java操作中 - 它保持在PREP狀態。

這裏的日誌是什麼樣子

2015-06-22 17:54:37,366 INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] Start action [[email protected]:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10] 
2015-06-22 17:54:37,367 WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] [***[email protected]:start:***]Action status=DONE 
2015-06-22 17:54:37,367 WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] [***[email protected]:start:***]Action updated in DB! 
2015-06-22 17:54:37,426 INFO ActionEndXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] end executor for wf action 0000030-150619153616589-oozie-oozi-W with wf job 0000030-150619153616589-oozie-oozi-W 
2015-06-22 17:54:37,676 INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] Start action [[email protected]] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10] 
2015-06-22 17:54:38,316 INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] addShareLib: using FileSystem hdfs://master:8020 
2015-06-22 17:54:38,501 WARN JavaActionExecutor:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] credentials is null for the action 
2015-06-22 17:54:38,640 INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] addShareLib: using FileSystem hdfs://master:8020 

enter image description here


今天早上我發現,作業狀態中止,起始節點是確定的,但是Java節點是在開始 - 重試,出現以下錯誤 - JA006:從master02.novalocal調用/ 192.168.111.52到master02.novalocal:8032連接失敗異常:java.net.ConnectException:連接被拒絕;有關更多詳細信息,請參閱:http://wiki.apache.org/hadoop/ConnectionRefused

我應該強調Oozie與資源管理器在同一臺計算機上工作,所以奇怪的是它試圖在同一臺計算機上啓動工作流程,但表示連接失敗。

這裏是Oozie的作業日誌:

2015-06-22 17:54:37,366 INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] Start action [[email protected]:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10] 
2015-06-22 17:54:37,367 WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] [***[email protected]:start:***]Action status=DONE 
2015-06-22 17:54:37,367 WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] [***[email protected]:start:***]Action updated in DB! 
2015-06-22 17:54:37,426 INFO ActionEndXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]:start:] end executor for wf action 0000030-150619153616589-oozie-oozi-W with wf job 0000030-150619153616589-oozie-oozi-W 
2015-06-22 17:54:37,676 INFO ActionStartXCommand:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] Start action [[email protected]] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10] 
2015-06-22 17:54:38,316 INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] addShareLib: using FileSystem hdfs://master01.novalocal:8020 
2015-06-22 17:54:38,501 WARN JavaActionExecutor:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] credentials is null for the action 
2015-06-22 17:54:38,640 INFO JavaActionExecutor:539 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] addShareLib: using FileSystem hdfs://master01.novalocal:8020 
2015-06-22 20:05:33,340 WARN ActionStartXCommand:542 - USER[hadoop] GROUP[-] TOKEN[] APP[java-filecopy-wf] JOB[0000030-150619153616589-oozie-oozi-W] ACTION[[email protected]] Error starting action [java1]. ErrorType [TRANSIENT], ErrorCode [ JA006], Message [ JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 
org.apache.oozie.action.ActionExecutorException: JA006: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 
    at org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:412) 
    at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:392) 
    at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:837) 
    at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:988) 
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:215) 
    at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:60) 
    at org.apache.oozie.command.XCommand.call(XCommand.java:280) 
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326) 
    at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255) 
    at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:744) 
Caused by: java.net.ConnectException: Call From master02.novalocal/192.168.111.52 to master02.novalocal:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 
    at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown Source) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1414) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1363) 
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) 
    at com.sun.proxy.$Proxy42.getDelegationToken(Unknown Source) 
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getDelegationToken(ApplicationClientProtocolPBClientImpl.java:282) 
    at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:606) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) 
    at com.sun.proxy.$Proxy43.getDelegationToken(Unknown Source) 
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getRMDelegationToken(YarnClientImpl.java:452) 
    at org.apache.hadoop.mapred.ResourceMgrDelegate.getDelegationToken(ResourceMgrDelegate.java:166) 
    at org.apache.hadoop.mapred.YARNRunner.getDelegationToken(YARNRunner.java:220) 
    at org.apache.hadoop.mapreduce.Cluster.getDelegationToken(Cluster.java:400) 
    at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1203) 
    at org.apache.hadoop.mapred.JobClient$16.run(JobClient.java:1200) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:415) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) 
    at org.apache.hadoop.mapred.JobClient.getDelegationToken(JobClient.java:1199) 
    at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:377) 
    at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1031) 
    at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:786) 
    ... 10 more 
Caused by: java.net.ConnectException: Connection refused 
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) 
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) 
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) 
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) 
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604) 
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699) 
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) 
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1381) 
    ... 33 more 
+0

你能否添加oozie中聲明的錯誤日誌? – karthik

+0

我會,但最奇怪的是我沒有得到任何錯誤。當我在Web界面中查看JobLog時,它完全是空的?我應該在哪裏查找錯誤? – Marko

+0

去工作日誌,並獲得工作日誌!或嘗試在歷史服務器!我認爲oozie成功地將這份工作交給了hadoop!請檢查歷史記錄服務器中的紗線日誌。 – karthik

回答

0

我敢打賭,你的地圖,減少羣集必須用盡插槽。查看配置了多少個地圖插槽。

還試圖找出服務是否在端口8032上。您可以使用命令sudo netstat -netulp | grep 8032.如果沒有輸出返回,則服務關閉。你也可以使用nmap或telnet檢查連接。

+0

謝謝你的回答,但是如果我沒有運行任何MR作業,這可能嗎?我只是開始一個簡單的java類,打印出一些文本,只是爲了檢查我的客戶端是否工作。 – Marko

+0

Oozie在Map-Reduce集羣上運行它的工作,所以首先你應該確保你的Map-Reduce集羣已經啓動並運行,擁有足夠的地圖插槽(至少兩個用於運行一個作業)。 –

+0

我編輯了我的問題,請查看。 – Marko

1

請檢查job.properties中的端口 這通常與namenode和jobtracker端口有關。 確保您的jobtracker端口在job.properties文件中正確。

1

oozie作業卡住PREP狀態(最終轉到START_MANUAL狀態)的主要原因是Hadoop服務端口的配置錯誤。

nameNode=hdfs://localhost:9000 
jobTracker=10.71.71.15:8032 

如果您正在運行的紗線,然後JobTracker的默認端口是相同的資源管理器的端口。

此外,請嘗試修復其他端口問題,如jobhistoryserver's port(如oozie錯誤消息中所述)。