2016-05-05 44 views
0

我的工作失敗與下列日誌,但是,我不完全理解。它似乎是由pyspark在谷歌dataproc失敗

YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 24.7 GB of 24 GB physical」造成的。

但是我該如何增加google dataproc中的內存。

日誌:

16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 332.0 in stage 0.0 (TID 332, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 335.0 in stage 0.0 (TID 335, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 329.0 in stage 0.0 (TID 329, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 
Traceback (most recent call last): 
    File "/tmp/5d6059b8-f9f4-4be6-9005-76c29a27af17/fetch.py", line 127, in <module> 
    main() 
    File "/tmp/5d6059b8-f9f4-4be6-9005-76c29a27af17/fetch.py", line 121, in main 
    d.saveAsTextFile('gs://ll_hang/decahose-hashtags/data-multi3') 
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1506, in saveAsTextFile 
    File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ 
    File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o50.saveAsTextFile. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 191 in stage 0.0 failed 4 times, most recent failure: Lost task 191.3 in stage 0.0 (TID 483, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 
Driver stacktrace: 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at scala.Option.foreach(Option.scala:236) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1213) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:952) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:952) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:952) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:951) 
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1457) 
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1436) 
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1436) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1436) 
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:507) 
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:46) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Thread.java:745) 

16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 280.1 in stage 0.0 (TID 475, cluster-4-w-3.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 283.1 in stage 0.0 (TID 474, cluster-4-w-67.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, cluster-4-w-95.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, cluster-4-w-95.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 184.1 in stage 0.0 (TID 463, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 81.0 in stage 0.0 (TID 81, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 85.0 in stage 0.0 (TID 85, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 84.0 in stage 0.0 (TID 84, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,[email protected],null) 
16/05/05 01:12:42 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 438.1 in stage 0.0 (TID 442, cluster-4-w-23.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,[email protected],null) 
16/05/05 01:12:42 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 114 idle 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 97.0 in stage 0.0 (TID 97, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 102.0 in stage 0.0 (TID 102, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,[email protected],null) 
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,[email protected],null) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 190.1 in stage 0.0 (TID 461, cluster-4-w-67.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 111.0 in stage 0.0 (TID 111, cluster-4-w-74.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 101.0 in stage 0.0 (TID 101, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally) 
16/05/05 01:12:42 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. 
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped. 
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161) 
    at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) 
    at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:578) 
    at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170) 
    at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104) 
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) 
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) 
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) 
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) 
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) 
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) 
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) 
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) 
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) 
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
    at java.lang.Thread.run(Thread.java:745) 
16/05/05 01:12:42 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 

回答

5

在Dataproc,火花被配置爲每包裝機的一半,其中的執行器然後並行運行多個任務取決於多少核處於半機器可用1個執行程序。例如,在n1-standard-4上,您希望每個執行程序都使用2個內核,從而一次同時運行兩個任務。內存同樣刻上去的,雖然有些記憶也預留給後臺服務,有的給YARN執行開銷等

這意味着在一般情況下,你必須增加每任務存儲器中的幾個選項:

  1. 您可以一次減少spark.executor.cores 1,最小值爲1;由於這使得spark.executor.memory保持不變,實際上,每個並行任務現在可以共享每執行器內存的較大部分。例如,在n1-standard-8上,默認設置爲spark.executor.cores=4,執行程序的內存大概是12GB左右,所以每個「任務」都可以使用〜3GB的內存。如果你設置了spark.executor.cores=3,這會使執行程序的內存達到12GB,現在每個任務都會達到〜4GB。您至少可以嘗試將其降至spark.executor.cores=1以查看這種方法是否可行;那麼只要作業仍然成功以確保良好的CPU利用率就可以增加它。另外

    gcloud dataproc jobs submit pyspark --properties spark.executor.cores=1 ... 
    
  2. ,你可以殺青spark.executor.memory;:您可以在作業提交時間做到這一點只需看看你的羣集資源gcloud dataproc clusters describe cluster-4,你應該看到當前的設置。

  3. 如果你不想浪費核心,你可能想嘗試不同的機器類型。例如,如果您目前使用的是n1-standard-8,請改爲n1-highmem-8。 Dataproc每個執行者仍然只有一半機器,所以你最終會爲每個執行者提供更多的內存。您也可以使用custom machine types來微調內存與CPU的平衡。