2016-09-29 53 views
4

移動我有一個8小時的工作(火花2.0.0)寫入,使用標準方法將結果寫入出來實木複合地板:阿帕奇星火S3未能拼花文件從臨時文件夾

processed_images_df.write.format("parquet").save(s3_output_path) 

它執行10000任務並將結果寫入_temporary文件夾,在最後一步(完成所有任務之後),它將複製_temporary文件夾中的parquet文件,但在複製大約2-3000個文件後失敗,首先我認爲這是是一個暫時的S3失敗,但我重新運行3次,並得到相同的錯誤):

org.apache.spark.SparkException: Job aborted. 
     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:149) 
     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) 
     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) 
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) 
     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) 
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) 
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) 
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) 
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) 
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) 
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) 
     at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) 
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) 
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:606) 
     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 
     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
     at py4j.Gateway.invoke(Gateway.java:280) 
     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 
     at py4j.commands.CallCommand.execute(CallCommand.java:79) 
     at py4j.GatewayConnection.run(GatewayConnection.java:211) 
     at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.http.NoHttpResponseException: s3-bucket.s3.amazonaws.com:443 failed to respond 
     at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) 
     at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) 
     at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) 
     at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) 
     at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:259) 
     at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:232) 
     at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) 
     at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) 
     at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686) 
     at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:488) 
     at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884) 
     at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) 
     at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) 
     at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326) 
     at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277) 
     at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143) 
     at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2117) 
     at org.jets3t.service.StorageService.copyObject(StorageService.java:898) 
     at org.jets3t.service.StorageService.copyObject(StorageService.java:943) 
     at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:320) 
     at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:606) 
     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) 
     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) 
     at org.apache.hadoop.fs.s3native.$Proxy20.copy(Unknown Source) 
     at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:645) 
     at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345) 
     at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) 
     at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) 
     at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) 
     at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) 
     at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) 
     ... 29 more 

回答

9

我發現這個問題的解決方案是更新的Hadoop至2.7,並設置

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 

在星火1.6有直接寫信給S3的fileoutputcommiter的替代版本,但它得到了火花2.0.0棄用:https://issues.apache.org/jira/browse/SPARK-10063

+4

男人你是生命的救星... – will

+0

我花了半天的時間尋找這條線以上!萬分感謝! – FacePalm

+0

查看更多信息:https://docs.databricks.com/spark/latest/faq/append-slow-with-spark-2.0.0.html –