從Pyspark中的多個目錄讀取實木複合地板文件

我需要從多個不是父目錄或子目錄的路徑讀取實木複合地板文件。從Pyspark中的多個目錄讀取實木複合地板文件

例如，

dir1 --- 
     | 
     ------- dir1_1 
     | 
     ------- dir1_2 
dir2 --- 
     | 
     ------- dir2_1 
     | 
     ------- dir2_2

sqlContext.read.parquet(dir1)從dir1_1地板讀取文件和dir1_2

現在我在讀每個目錄和合並使用「unionAll」 dataframes。有沒有一種方法來讀取dir1_2和dir2_1拼花文件，而不使用unionAll或有使用unionAll

感謝

來源

2016-05-16 joshsuihn

兩者的SQLContext的parquetFile方法和DataFrameReader的parquet方法採取多條路徑任何花哨的方式。因此，無論這些作品：

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

或

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

來源

2016-05-17 06:37:32

有點晚，但我發現這一點的同時我正在尋找它可以幫助別人......

您也可以嘗試拆包參數列表爲spark.read.parquet()

paths=['foo','bar'] 
df=spark.read.parquet(*paths)

如果你想通過幾個blobs進入路徑參數：

basePath='s3://bucket/' 
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*', 
     's3://bucket/partition_value1=*/partition_value2=2017-05-*' 
     ] 
df=spark.read.option("basePath",basePath).parquet(*paths)

這是很酷的，因爲你並不需要列出的基本路徑中的所有文件，你仍然可以得到分區推斷。

來源

2017-05-10 00:03:08 N00b

當我使用這段代碼時，它正在搜索/ home /目錄中的目錄，你可以發佈整個語法嗎？ – Viv

@N00b當我嘗試這段代碼時，它給了我一個錯誤，即加載只需要4個參數，但我有我的路徑到24個文件..是否有一個選項可以覆蓋此。我正在嘗試不執行多個加載和一個聯合，這就是爲什麼我想使用加載將多個文件放入df –

只要考慮John Conley的回答，並對其進行修飾並提供完整的代碼（用於Jupyter PySpark），因爲我發現他的答案非常有用。

from hdfs import InsecureClient 
client = InsecureClient('http://localhost:50070') 

import posixpath as psp 
fpaths = [ 
    psp.join("hdfs://localhost:9000" + dpath, fname) 
    for dpath, _, fnames in client.walk('/eta/myHdfsPath') 
    for fname in fnames 
] 
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths) 


import pandas 
pdf = parquetFile.toPandas() 
# display the contents nicely formatted. 
pdf

來源

2017-10-26 17:57:13 VenVig

從Pyspark中的多個目錄讀取實木複合地板文件

回答

相關問題