如何將csv直接加載到Spark數據集中？

我有一個csv文件[1]，我想直接加載到數據集中。問題是，我總是得到這樣的錯誤如何將csv直接加載到Spark數據集中？

org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate 
The type path of the target object is: 
- field (class: "scala.Float", name: "probability") 
- root class: "TFPredictionFormat" 
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;

此外，專門爲phrases場（檢查例類[2]），它獲得

org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);

如果我在我的情況下定義的所有字段類[2]作爲類型字符串然後一切正常，但這不是我想要的。有沒有簡單的方法來做到這一點[3]？

參考

[1]一種示例行

B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781

[2]我的代碼段如下所示

import spark.implicits._ 

val INPUT_TF = "<SOME_URI>/my_file.csv" 

final case class TFFormat (
    doc_id: String, 
    brand: String, 
    phrases: Seq[String], 
    prediction: String, 
    probability: Float 
) 

val ds = sqlContext.read 
.option("header", "true") 
.option("charset", "UTF8") 
.csv(INPUT_TF) 
.as[TFFormat] 

ds.take(1).map(println)

[3]我已經找到了首先在DataFrame級別定義列並將其轉換爲Datase噸（如here或here或here），但我幾乎可以肯定，這不是事情應該完成的方式。我也敢肯定，編碼器可能是答案，但我沒有線索

來源

2017-03-08 Vassilis Moustakas

TL如何; DR與標準DataFrame操作csv輸入轉化是要走的路。如果你想避免你應該使用具有表現力的輸入格式（Parquet甚至JSON）。

一般來說，要轉換爲靜態類型的數據集的數據必須已經是正確的類型。最有效的方式做到這一點是爲csv讀者提供schema說法：

val schema: StructType = ??? 
val ds = spark.read 
    .option("header", "true") 
    .schema(schema) 
    .csv(path) 
    .as[T]

其中schema可以通過反射來推斷：

import org.apache.spark.sql.catalyst.ScalaReflection 
import org.apache.spark.sql.types.StructType 

val schema = ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]

不幸的是它不會與您的數據和類，因爲工作csv閱讀器不支持ArrayType（但它適用於像FloatType這樣的原子類型），因此您必須使用困難的方法。一個天真的解決方案，可以如下表示：

import org.apache.spark.sql.functions._ 

val df: DataFrame = ??? // Raw data 

df 
    .withColumn("probability", $"probability".cast("float")) 
    .withColumn("phrases", 
    split(regexp_replace($"phrases", "[\\['\\]]", ""), ",")) 
    .as[TFFormat]

但是你可能需要一些更復雜的取決於phrases內容。

來源

2017-03-08 18:25:17 user6910411

謝謝！只需添加一個角度：也可以使用Encoders來推斷模式：'Encoders.product [TFFormat] .schema' –

如何將csv直接加載到Spark數據集中？

回答

相關問題