我正在尋找使用python從s3中讀取多個分區目錄中的數據的方法。如何使用Python中的pyarrow從S3中讀取分區的實木複合地址文件
data_folder/SERIAL_NUMBER = 1/cur_date = 20-12-2012/abcdsd0324324.snappy.parquet data_folder/SERIAL_NUMBER = 2/cur_date = 27-12-2012/asdsdfsd0324324.snappy.parquet
pyarrow的ParquetDataset模塊有能力從分區讀取。所以,我曾嘗試下面的代碼:
>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
它扔了以下錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
基於pyarrow的文檔,我試着用s3fs作爲文件系統,即:
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
其中拋出以下錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in _make_manifest
if is_string(path_or_paths) and fs.isdir(path_or_paths):
AttributeError: module 's3fs' has no attribute 'isdir'
我僅限於使用ECS羣集,因此spark/pyspark不是選項。
有沒有一種方法,我們可以輕鬆地讀取parquet文件,在Python中從這樣的分區目錄在S3?我覺得列出所有的目錄,然後閱讀,這不是link建議的良好做法。我需要將讀取的數據轉換爲熊貓數據框以供進一步處理&,因此更喜歡與fastparquet或pyarrow相關的選項。我也接受python中的其他選項。