在嘗試使用Spark將數據幀寫入S3時遇到S3 SignatureDoesNotMatch
。Spark寫入S3 V4 SignatureDoesNotMatch錯誤
症狀/東西都試過:
- 代碼失敗有時但工程有時;
- 的代碼可以讀取S3沒有任何問題,並且能夠從時間寫S3時間,這就排除了像S3A/enableV4 /錯了鍵錯配置設置/地區端點等
- S3A端點已經根據S3文檔S3 Endpoint設置;
- 確保
AWS_SECRETY_KEY
不包含任何非字母數字按照建議here; - 通過使用NTP確保服務器時間處於同步狀態;
- 以下在EC2
m3.xlarge
上測試,spark-2.0.2-bin-hadoop2.7
在本地模式下運行; - 將文件寫入本地fs時問題消失;
- 現在,解決方法是使用s3fs掛載存儲桶並寫入該存儲區;然而,這並不理想,因爲s3fs經常從Spark所施加的壓力中消失;
的代碼可以歸結爲:
spark-submit\
--verbose\
--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\
--packages org.apache.hadoop:hadoop-aws:2.7.3\
--driver-java-options '-Dcom.amazonaws.services.s3.enableV4'\
foobar.py
# foobar.py
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'xxx')
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 's3.dualstack.ap-southeast-2.amazonaws.com')
hc = SparkSession.builder.enableHiveSupport().getOrCreate()
dataframe = hc.read.parquet(in_file_path)
dataframe.write.csv(
path=out_file_path,
mode='overwrite',
compression='gzip',
sep=',',
quote='"',
escape='\\',
escapeQuotes='true',
)
火花溢出以下error。
集的log4j爲verbose,它顯示以下沒有發生:
- 每個單獨的將被輸出到上S3
/_temporary/foorbar.part-xxx
staing位置; - PUT調用會將分區移動到最終位置;
- 經過幾次成功的PUT調用後,由於403,所有後續的PUT調用失敗;
- 由於需求由aws-java-sdk完成,所以不確定在應用程序級別做什麼; - 以下日誌來自另一個具有完全相同錯誤的事件;
>> PUT XXX/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet HTTP/1.1
>> Host: XXX.s3-ap-southeast-2.amazonaws.com
>> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
>> X-Amz-Date: 20161104T005749Z
>> x-amz-metadata-directive: REPLACE
>> Connection: close
>> User-Agent: aws-sdk-java/1.10.11 Linux/3.13.0-100-generic OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91 com.amazonaws.services.s3.transfer.TransferManager/1.10.11
>> x-amz-server-side-encryption-aws-kms-key-id: 5f88a222-715c-4a46-a64c-9323d2d9418c
>> x-amz-server-side-encryption: aws:kms
>> x-amz-copy-source: /XXX/_temporary/0/task_201611040057_0001_m_000025/part-r-00025-ae3d5235-932f-4b7d-ae55-b159d1c1343d.gz.parquet
>> Accept-Ranges: bytes
>> Authorization: AWS4-HMAC-SHA256 Credential=AKIAJZCSOJPB5VX2B6NA/20161104/ap-southeast-2/s3/aws4_request, SignedHeaders=accept-ranges;connection;content-length;content-type;etag;host;last-modified;user-agent;x-amz-content-sha256;x-amz-copy-source;x-amz-date;x-amz-metadata-directive;x-amz-server-side-encryption;x-amz-server-side-encryption-aws-kms-key-id, Signature=48e5fe2f9e771dc07a9c98c7fd98972a99b53bfad3b653151f2fcba67cff2f8d
>> ETag: 31436915380783143f00299ca6c09253
>> Content-Type: application/octet-stream
>> Content-Length: 0
DEBUG wire: << "HTTP/1.1 403 Forbidden[\r][\n]"
DEBUG wire: << "x-amz-request-id: 849F990DDC1F3684[\r][\n]"
DEBUG wire: << "x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=[\r][\n]"
DEBUG wire: << "Content-Type: application/xml[\r][\n]"
DEBUG wire: << "Transfer-Encoding: chunked[\r][\n]"
DEBUG wire: << "Date: Fri, 04 Nov 2016 00:57:48 GMT[\r][\n]"
DEBUG wire: << "Server: AmazonS3[\r][\n]"
DEBUG wire: << "Connection: close[\r][\n]"
DEBUG wire: << "[\r][\n]"
DEBUG DefaultClientConnection: Receiving response: HTTP/1.1 403 Forbidden
<< HTTP/1.1 403 Forbidden
<< x-amz-request-id: 849F990DDC1F3684
<< x-amz-id-2: 6y16TuQeV7CDrXs5s7eHwhrpa1Ymf5zX3IrSuogAqz9N+UN2XdYGL2FCmveqKM2jpGiaek5rUkM=
<< Content-Type: application/xml
<< Transfer-Encoding: chunked
<< Date: Fri, 04 Nov 2016 00:57:48 GMT
<< Server: AmazonS3
<< Connection: close
DEBUG requestId: x-amzn-RequestId: not available
'X-AMZ-日期:20161104T005749Z'是1個多月前。這個日誌條目是否也是舊的? –
@ Michael-sqlbot是的,之前我們遇到過這個問題,那時候(11月初)我們通過降低分區大小來降低分區大小(在這個例子中爲11),並且(aws-java-sdk詳細)日誌是從那時起。作爲根本原因從來沒有確定,現在這個問題已經重新浮出水面,我在這裏挖出這些日誌作爲一個例子 –
我可以證實,通過像'df.repartition之類的東西將分區數量減少到像10這樣的低數量(10).write.parquet(「s3a://」+ s3_bucket_out +「/」,mode =「覆蓋」,compression =「snappy」))'似乎可以避免這個問題。 – asmaier