2017-06-20 54 views
0

我在星火以下數據框和模式火花SQL轉換函數創建了NULLS列

val df = spark.read.options(Map("header"-> "true")).csv("path") 

scala> df show() 

+-------+-------+-----+ 
| user| topic| hits| 
+-------+-------+-----+ 
|  om| scala| 120| 
| daniel| spark| 80| 
|3754978| spark| 1| 
+-------+-------+-----+ 

scala> df printSchema 

root 
|-- user: string (nullable = true) 
|-- topic: string (nullable = true) 
|-- hits: string (nullable = true) 

我想改變列命中整數

我嘗試這樣做:

scala> df.createOrReplaceTempView("test") 
    val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test") 

scala> dfNew.printSchema 

root 
|-- user: string (nullable = true) 
|-- topic: string (nullable = true) 
|-- hits: string (nullable = true) 
|-- hist2: integer (nullable = true) 

但是當我打印數據框列HIST 2被用空值填充

scala> dfNew show() 

+-------+-------+-----+-----+ 
| user| topic| hits|hist2| 
+-------+-------+-----+-----+ 
|  om| scala| 120| null| 
| daniel| spark| 80| null| 
|3754978| spark| 1| null| 
+-------+-------+-----+-----+ 

我也試過這樣:

scala> val df2 = df.withColumn("hitsTmp", 
df.hits.cast(IntegerType)).drop("hits" 
).withColumnRenamed("hitsTmp", "hits") 

,並得到這個:

<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram 
e 

也試過這樣:

scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits") 

and got this: 
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col 
umns: [user, topic, hits]; line 1 pos 0; 
'Project [user#0, 'topic, cast('hits as int) AS hits#22] 
+- Relation[user#0, topic#1, hits#2] csv 

scala> val df2 = df.selectExpr ("cast(hits as int) hits") 

我得到SI類似錯誤。

任何幫助將不勝感激。我知道這個問題已經解決過,但我嘗試了3種不同的方法(這裏發表),沒有一個工作。

謝謝。

+0

我使用的是2.1.0版本 –

回答

1

您可以在以下幾個方面

df.withColumn("hits", df("hits").cast("integer"))

或者

轉換爲整數類型的列
data.withColumn("hitsTmp", 
     data("hits").cast(IntegerType)).drop("hits"). 
     withColumnRenamed("hitsTmp", "hits") 

或者

data.selectExpr ("user","topic","cast(hits as int) hits") 
+0

我試着他們都沒有成功。 –

+0

scala> val df2 = df.withColumn(「hits」,df(「hits」)。cast(「integer」)) org.apache.spark.sql.AnalysisException:無法解析 (用戶,主題,點擊); 在org.apache.spark.sql.Dataset $$ anonfun $決心$ 1.適用(Dataset.scala:219) 在org.apache.spark.sql.Dataset $$ anonfun $ $解析1.適用(Dataset.scala: 219) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218) at org.apache.spark.sql.Dataset.col(Option 219) (scala.Option.getOrElse(Option.scala:121) Dataset.scala:1073) at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059) ... 48 elided –

+0

scala> val df2 = df.selectExpr(「user」,「topic」, 「鑄態(如命中INT)命中」) org.apache.spark.sql.AnalysisException:無法解析 ''topic'' 給定的輸入欄 UMNS:用戶,主題,命中];第1行pos 0; 'Project [user#0,'topic,cast('hit as int)AS hit#40] + - Relation [user#0,topic#1,hits#2] csv –