2016-04-15 67 views
8

任何人在法蘭克福使用s3使用hadoop/spark 1.6.0?使用S3(法蘭克福)和Spark

我試圖存儲在S3工作的結果,我的依賴性聲明如下:

"org.apache.spark" %% "spark-core" % "1.6.0" exclude("org.apache.hadoop", "hadoop-client"), 
"org.apache.spark" %% "spark-sql" % "1.6.0", 
"org.apache.hadoop" % "hadoop-client" % "2.7.2", 
"org.apache.hadoop" % "hadoop-aws" % "2.7.2" 

我已經設置了以下配置:

System.setProperty("com.amazonaws.services.s3.enableV4", "true") 
sc.hadoopConfiguration.set("fs.s3a.endpoint", ""s3.eu-central-1.amazonaws.com") 

當調用saveAsTextFile我它啓動RDD,將所有內容保存在S3上。但是一段時間後,當它從_temporary轉移到最終輸出後導致它產生的錯誤:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXXXXXXXX, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX= 
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) 
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) 
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) 
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) 
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507) 
at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143) 
at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131) 
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189) 
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134) 
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 

如果我使用hadoop-client從火花包它甚至沒有開始傳輸。錯誤隨機發生,有時會起作用,有時不起作用。

+0

似乎與你的SSH密鑰的問題。你能檢查你使用的是正確的鑰匙嗎? – user1314742

+0

數據開始在s3上保存,並在一段時間後出現錯誤。 – flaviotruzzi

+0

@flaviotruzzi你解決了這個問題嗎? – pangpang

回答

3

請嘗試設置以下值:

System.setProperty("com.amazonaws.services.s3.enableV4", "true") 
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
hadoopConf.set("com.amazonaws.services.s3.enableV4", "true") 
hadoopConf.set("fs.s3a.endpoint", "s3." + region + ".amazonaws.com") 

請設置好該桶所在的區域,在我的情況是:eu-central-1

,並添加依賴到gradle產出或以其他方式:

dependencies { 
    compile 'org.apache.hadoop:hadoop-aws:2.7.2' 
} 

希望它能幫上忙。

1

在您使用pyspark情況下,以下爲我工作

aws_profile = "your_profile" 
aws_region = "eu-central-1" 
s3_bucket = "your_bucket" 

# see https://github.com/jupyter/docker-stacks/issues/127#issuecomment-214594895 
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell" 

# If this doesn't work you might have to delete your ~/.ivy2 directory to reset your package cache. 
# (see https://github.com/databricks/spark-redshift/issues/244#issuecomment-239950148) 
import pyspark 
sc=pyspark.SparkContext() 
# see https://github.com/databricks/spark-redshift/issues/298#issuecomment-271834485 
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") 

# see https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark 
hadoop_conf=sc._jsc.hadoopConfiguration() 
# see https://stackoverflow.com/questions/43454117/how-do-you-use-s3a-with-spark-2-1-0-on-aws-us-east-2 
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true") 
hadoop_conf.set("fs.s3a.access.key", access_id) 
hadoop_conf.set("fs.s3a.secret.key", access_key) 

# see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region 
hadoop_conf.set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com") 

sql=pyspark.sql.SparkSession(sc) 
path = s3_bucket + "your_file_on_s3" 
dataS3=sql.read.parquet("s3a://" + path)