SparkR中是否有basepath數據選項？

我在S3的明確修剪模式結構，引起以下錯誤，當我read.parquet()：SparkR中是否有basepath數據選項？

Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths 
    s3a://leftout/for/security/dashboard/updateddate=20170217 
    s3a://leftout/for/security/dashboard/updateddate=20170218

的（長篇）錯誤告訴我進一步下跌...

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.

我不能然而，，請使用SparkR::read.parquet(...)查找有關如何執行此操作的任何文檔。有誰知道如何在R（與SparkR）中做到這一點？

> version 

platform  x86_64-redhat-linux-gnu  
arch   x86_64      
os    linux-gnu     
system   x86_64, linux-gnu   
status          
major   3       
minor   2.2       
year   2015       
month   08       
day   14       
svn rev  69053      
language  R       
version.string R version 3.2.2 (2015-08-14) 
nickname  Fire Safety  

> sessionInfo() 
R version 3.2.2 (2015-08-14) 
Platform: x86_64-redhat-linux-gnu (64-bit) 
Running under: Amazon Linux AMI 2016.09 

locale: 
[1] LC_CTYPE=en_US.UTF-8  LC_NUMERIC=C    LC_TIME=en_US.UTF-8  LC_COLLATE=en_US.UTF-8  
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8  LC_NAME=C     
[9] LC_ADDRESS=C    LC_TELEPHONE=C    LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C  

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] lubridate_1.6.0 SparkR_2.0.2  DT_0.2   jsonlite_1.2  shinythemes_1.1.1 ggthemes_3.3.0 
[7] dplyr_0.5.0  ggplot2_2.2.1  leaflet_1.0.1  shiny_1.0.0  

loaded via a namespace (and not attached): 
[1] Rcpp_0.12.9  magrittr_1.5  munsell_0.4.3  colorspace_1.3-2 xtable_1.8-2  R6_2.2.0   
[7] stringr_1.1.0  plyr_1.8.4  tools_3.2.2  grid_3.2.2  gtable_0.2.0  DBI_0.5-1   
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14  lazyeval_0.2.0 digest_0.6.12  assertthat_0.1 
[19] tibble_1.2  htmlwidgets_0.8 mime_0.5   stringi_1.1.2  scales_0.4.1  httpuv_1.3.3

來源

2017-02-21 d8aninja

In Spark 2.1 or later可以傳遞basePath作爲命名參數：由省略號捕獲

read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")

參數被轉換與varargsToStrEnv和used as options。

完整會話例如：

寫入數據（斯卡拉）：

Seq(("a", 1), ("b", 2)).toDF("k", "v") 
    .write.partitionBy("k").mode("overwrite").parquet("/tmp/data")

讀取數據（SparkR）：

Welcome to 
    ____    __ 
/__/__ ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/ '_/ 
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0 
    /_/ 


SparkSession available as 'spark'.

> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE) 
> read.parquet(paths, basePath="/tmp/data")

SparkDataFrame[v:int, k:string]

相反，沒有basePath：

> read.parquet(paths)

SparkDataFrame[v:int]

來源

2017-02-21 21:18:30 user6910411

因爲我來這是接近。從source code：

read.parquet.default <- function(path, ...) { 
    sparkSession <- getSparkSession() 
    options <- varargsToStrEnv(...) 
    # Allow the user to have a more flexible definiton of the Parquet file path 
    paths <- as.list(suppressWarnings(normalizePath(path))) 
    read <- callJMethod(sparkSession, "read") 
    read <- callJMethod(read, "options", options) 
    sdf <- handledCallJMethod(read, "parquet", paths) 
    dataFrame(sdf) 
}

這種方法也可以here而且還拋出unused argument錯誤：

read.parquet(..., options=c(basePath="foo"))

來源

2017-02-21 21:27:13 d8aninja

SparkR中是否有basepath數據選項？

回答

相關問題