2017-02-21 139 views
2

我在S3的明確修剪模式結構,引起以下錯誤,當我read.parquet()SparkR中是否有basepath數據選項?

Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths 
    s3a://leftout/for/security/dashboard/updateddate=20170217 
    s3a://leftout/for/security/dashboard/updateddate=20170218 

的(長篇)錯誤告訴我進一步下跌...

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. 

我不能然而, ,請使用SparkR::read.parquet(...)查找有關如何執行此操作的任何文檔。有誰知道如何在R(與SparkR)中做到這一點?

> version 

platform  x86_64-redhat-linux-gnu  
arch   x86_64      
os    linux-gnu     
system   x86_64, linux-gnu   
status          
major   3       
minor   2.2       
year   2015       
month   08       
day   14       
svn rev  69053      
language  R       
version.string R version 3.2.2 (2015-08-14) 
nickname  Fire Safety  

> sessionInfo() 
R version 3.2.2 (2015-08-14) 
Platform: x86_64-redhat-linux-gnu (64-bit) 
Running under: Amazon Linux AMI 2016.09 

locale: 
[1] LC_CTYPE=en_US.UTF-8  LC_NUMERIC=C    LC_TIME=en_US.UTF-8  LC_COLLATE=en_US.UTF-8  
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8  LC_NAME=C     
[9] LC_ADDRESS=C    LC_TELEPHONE=C    LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C  

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] lubridate_1.6.0 SparkR_2.0.2  DT_0.2   jsonlite_1.2  shinythemes_1.1.1 ggthemes_3.3.0 
[7] dplyr_0.5.0  ggplot2_2.2.1  leaflet_1.0.1  shiny_1.0.0  

loaded via a namespace (and not attached): 
[1] Rcpp_0.12.9  magrittr_1.5  munsell_0.4.3  colorspace_1.3-2 xtable_1.8-2  R6_2.2.0   
[7] stringr_1.1.0  plyr_1.8.4  tools_3.2.2  grid_3.2.2  gtable_0.2.0  DBI_0.5-1   
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14  lazyeval_0.2.0 digest_0.6.12  assertthat_0.1 
[19] tibble_1.2  htmlwidgets_0.8 mime_0.5   stringi_1.1.2  scales_0.4.1  httpuv_1.3.3    

回答

2

In Spark 2.1 or later可以傳遞basePath作爲命名參數:由省略號捕獲

read.parquet(path, basePath="s3a://leftout/for/security/dashboard/") 

參數被轉換與varargsToStrEnvused as options

完整會話例如:

  • 寫入數據(斯卡拉):

    Seq(("a", 1), ("b", 2)).toDF("k", "v") 
        .write.partitionBy("k").mode("overwrite").parquet("/tmp/data") 
    
  • 讀取數據(SparkR):

    Welcome to 
        ____    __ 
    /__/__ ___ _____/ /__ 
        _\ \/ _ \/ _ `/ __/ '_/ 
    /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 
        /_/ 
    
    
    SparkSession available as 'spark'. 
    
    > paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE) 
    > read.parquet(paths, basePath="/tmp/data") 
    
    SparkDataFrame[v:int, k:string] 
    

    相反,沒有basePath

    > read.parquet(paths) 
    
    SparkDataFrame[v:int] 
    
0

因爲我來這是接近。從source code

read.parquet.default <- function(path, ...) { 
    sparkSession <- getSparkSession() 
    options <- varargsToStrEnv(...) 
    # Allow the user to have a more flexible definiton of the Parquet file path 
    paths <- as.list(suppressWarnings(normalizePath(path))) 
    read <- callJMethod(sparkSession, "read") 
    read <- callJMethod(read, "options", options) 
    sdf <- handledCallJMethod(read, "parquet", paths) 
    dataFrame(sdf) 
} 

這種方法也可以here而且還拋出unused argument錯誤:

read.parquet(..., options=c(basePath="foo"))