我有一個csv文件[1],我想直接加載到數據集中。問題是,我總是得到這樣的錯誤如何將csv直接加載到Spark數據集中?
org.apache.spark.sql.AnalysisException: Cannot up cast `probability` from string to float as it may truncate
The type path of the target object is:
- field (class: "scala.Float", name: "probability")
- root class: "TFPredictionFormat"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
此外,專門爲phrases
場(檢查例類[2]),它獲得
org.apache.spark.sql.AnalysisException: cannot resolve '`phrases`' due to data type mismatch: cannot cast StringType to ArrayType(StringType,true);
如果我在我的情況下定義的所有字段類[2]作爲類型字符串然後一切正常,但這不是我想要的。有沒有簡單的方法來做到這一點[3]?
參考
[1]一種示例行
B017NX63A2,Merrell,"['merrell_for_men', 'merrell_mens_shoes', 'merrel']",merrell_shoes,0.0806054356579781
[2]我的代碼段如下所示
import spark.implicits._
val INPUT_TF = "<SOME_URI>/my_file.csv"
final case class TFFormat (
doc_id: String,
brand: String,
phrases: Seq[String],
prediction: String,
probability: Float
)
val ds = sqlContext.read
.option("header", "true")
.option("charset", "UTF8")
.csv(INPUT_TF)
.as[TFFormat]
ds.take(1).map(println)
[3]我已經找到了首先在DataFrame級別定義列並將其轉換爲Datase噸(如here或here或here),但我幾乎可以肯定,這不是事情應該完成的方式。我也敢肯定,編碼器可能是答案,但我沒有線索
謝謝!只需添加一個角度:也可以使用Encoders來推斷模式:'Encoders.product [TFFormat] .schema' –