從CSV文件創建Spark數據集

我想從簡單的CSV文件創建Spark數據集。下面是CSV文件的內容：從CSV文件創建Spark數據集

name,state,number_of_people,coolness_index 
trenton,nj,"10","4.5" 
bedford,ny,"20","3.3" 
patterson,nj,"30","2.2" 
camden,nj,"40","8.8"

這裏是使數據集的代碼：

var location = "s3a://path_to_csv" 

case class City(name: String, state: String, number_of_people: Long) 

val cities = spark.read 
    .option("header", "true") 
    .option("charset", "UTF8") 
    .option("delimiter",",") 
    .csv(location) 
    .as[City]

以下是錯誤消息：「不能達到投number_of_people從字符串到BIGINT，因爲它可能截斷「

Databricks談論如何在this blog post中創建數據集和此特定錯誤消息。

編碼器急切地檢查你的數據預期的架構，匹配提供錯誤信息幫助你試圖將數據的錯誤過程的TB之前。例如，如果我們嘗試使用太小的數據類型，例如轉換爲對象會導致截斷（即numStudents大於一個字節，其中最大值爲255），分析器將發出一個數據類型爲 AnalysisException。

我使用的是Long類型，所以我沒想到會看到這個錯誤信息。

來源

2016-09-16 Powers

使用架構推斷：

val cities = spark.read 
    .option("inferSchema", "true") 
    ...

或提供模式：

val cities = spark.read 
    .schema(StructType(Array(StructField("name", StringType), ...)

或投：

val cities = spark.read 
    .option("header", "true") 
    .csv(location) 
    .withColumn("number_of_people", col("number_of_people").cast(LongType)) 
    .as[City]

來源

2016-09-16 01:05:45

與案例類城市（名稱：字符串，狀態：字符串， number_of_people：長），你只需要一條線

private val cityEncoder = Seq(City("", "", 0)).toDS

，那麼你的代碼

val cities = spark.read 
.option("header", "true") 
.option("charset", "UTF8") 
.option("delimiter",",") 
.csv(location) 
.as[City]

將只是工作。

這是，得自這個網站的官方消息[http://spark.apache.org/docs/latest/sql-programming-guide.html#overview][1]

來源

2017-07-27 10:28:40

從CSV文件創建Spark數據集

回答

相關問題