如何在Spark Dataframe中的列之間做一些複雜的計算？

例如：如何在Spark Dataframe中的列之間做一些複雜的計算？

val calresult1 = indexedresult.withColumn("_4", lit(1)) 
calresult1.show() 
+---+---+------------------+---+ 
| _1| _2|    _3| _4| 
+---+---+------------------+---+ 
| 5| 2|    5.0| 1| 
| 5| 0|0.5555555555555554| 1| 
| 4| 0| 3.222222222222222| 1| 
| 3| 5|    1.0| 1| 
......

我可以用做一些簡單的計算+， - ，*，/：

val calresult2 = calresult1.withColumn("_5", calresult1.col("_4")/(calresult1.col("_3"))).select("_1","_2","_5") 
calresult2.show() 
+---+---+------------------+ 
| _1| _2|    _5| 
+---+---+------------------+ 
| 5| 2|    0.2| 
| 5| 0|1.8000000000000007| 
| 4| 0|    1.0| 
......

但不能使用戰俘和開方：

val calresult2 = calresult1.withColumn("_5", pow(calresult1.col("_4")+(calresult1.col("_3")))).select("_1","_2","_5") 
calresult2.show()

錯誤：

Error:(414, 53) could not find implicit value for parameter impl: breeze.numerics.pow.Impl[org.apache.spark.sql.Column,VR] 
val calresult2 = calresult1.withColumn("_5", pow(calresult1.col("_4")+(calresult1.col("_3")))).select("_1","_2","_5") 
               ^

如何實現複雜的公式？

來源

2017-07-17 Pi Pi

pow()需要2個Double類型的參數。我相信你錯過了第二個參數：

pow(calresult1.col("_4")+(calresult1.col("_3")))

提供了第二個參數，如下面的例子就可以解決問題：

import org.apache.spark.sql.functions._ 

val calresult2 = calresult1.withColumn(
    "_5", pow(calresult1.col("_4")+(calresult1.col("_3")), 2.0) 
).select(
    "_1","_2","_5" 
).show 

+---+---+------------------+ 
| _1| _2|    _5| 
+---+---+------------------+ 
| 5| 2|    36.0| 
| 5| 0|2.4197530864197523| 
| 4| 0|17.827160493827154| 
| 3| 5|    4.0| 
+---+---+------------------+

來源

2017-07-17 01:23:57

如何控制精度？例如：將2.4197530864197523轉換爲2.41975。 –

只需使用'round（）'作爲第二個參數即可。使用前面的例子，'round（pow（calresult1.col（「_ 4」）+（calresult1.col（「_ 3」）），2.0），5）'會給出想要的精度。 –

只需使用內置功能：

import org.apache.spark.sql.functions.{pow, sqrt}

，你會好的。

一般來說，您可以使用UserDefinedFunctions但這裏並不需要。

來源

2017-07-17 01:10:16 user8317003

如何在Spark Dataframe中的列之間做一些複雜的計算？

回答

相關問題