2016-11-15 303 views
3

我正在使用Spark 2.0。在DataFrameWriter上使用partitionBy寫入列名不僅僅是值的目錄佈局

我有一個DataFrame。我的代碼看起來像下面這樣:

df.write.partitionBy("year", "month", "day").format("csv").option("header", "true").save(s"s3://bucket/") 

而且程序執行時,在下面的格式寫入文件:

s3://bucket/year=2016/month=11/day=15/file.csv 

如何配置的格式是這樣的:

s3://bucket/2016/11/15/file.csv 

我也想知道是否可以配置文件名。

這裏是一個看起來很稀疏的相關文件...
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

partitionBy(colNames: String*): DataFrameWriter[T] 
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like: 

year=2016/month=01/ 
year=2016/month=02/ 
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands. 

This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well. 

回答

0

這是預期和期望的行爲。 Spark使用目錄結構進行分區發現和修剪,正確的結構(包括列名)對於它的工作是必需的。

您還必須記住,分區將刪除用於分區的列。

如果您需要特定的目錄結構,則應使用下游進程重命名目錄。

0

您可以使用下面的腳本來重新佈局的目錄的名稱:

#!/usr/bin/env bash 

# Rename repartition folder: delete COLUMN=, e.g. DATE=20170708 to 20170708. 

path=$1 
col=$2 
for f in `hdfs dfs -ls $ | awk '{print $NF}' | grep $col=`; do 
    a="$(echo $f | sed s/$col=//)" 
    hdfs dfs -mv "$f" "$a" 
done