如何使用星火UDF

選項我有一個數據集是這樣的：如何使用星火UDF

+----+------+ 
|code|status| 
+-----------+ 
| 1| "new"| 
| 2| null| 
| 3| null| 
+----+------+

我想編寫依賴於兩列的UDF。

我得到它的工作按照this answer第二種方法是處理null的UDF之外，寫myFn採取布爾作爲第二個參數：

df.withColumn("new_column", 
    when(pst_regs("status").isNull, 
    myFnUdf($"code", lit(false)) 
) 
    .otherwise(
    myFnUdf($"code", lit(true)) 
) 
)

要在UDF處理空的方法我看着是this answer，談論「用Options包裝參數」。我想這樣的代碼：

df.withColumn("new_column", myFnUdf($"code", $"status")) 

def myFn(code: Int, status: String) = (code, Option(status)) match { 
    case (1, "new") => "1_with_new_status" 
    case (2, Some(_)) => "2_with_any_status" 
    case (3, None) => "3_no_status" 
}

但隨着null一個行給出type mismatch; found :None.type required String。我也嘗試在udf創建期間用Option包裝參數而沒有成功。這個（沒有選項）的基本形式如下：

myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))

我是新來的Scala，所以我敢肯定，我失去了一些東西簡單。我的一些混淆可能是從功能創建udfs的不同語法（例如，根據https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html），所以我不確定我是否使用了最好的方法。任何幫助感謝！

編輯

編輯補充缺少的每@ user6910411和@sgvd評論(1, "new")情況。

來源

2016-12-15 Derek Hill

首先，可能有一些您正在使用的代碼，我們在這裏丟失。當我嘗試您的示例myFn，與val myFnUdf = udf(myFn _)一起製作爲UDF並使用df.withColumn("new_column", myFnUdf($"code", $"status")).show運行時，我沒有發現類型不匹配，而是輸入MatchError，這同樣也是user6910411指出的。這是因爲沒有模式匹配(1, "new")。

除此之外，雖然通常使用Scala的選項比使用原始值更好，但在這種情況下您不必這樣做。下面的示例適用於null直接：

val my_udf = udf((code: Int, status: String) => status match { 
    case null => "no status" 
    case _ => "with status" 
}) 

df.withColumn("new_column", my_udf($"code", $"status")).show

結果：

+----+------+-----------+ 
|code|status| new_column| 
+----+------+-----------+ 
| 1| new|with status| 
| 2| null| no status| 
| 2| null| no status| 
+----+------+-----------+

包裝與選項後仍然工作，雖然：

val my_udf = udf((code: Int, status: String) => Option(status) match { 
    case None => "no status" 
    case Some(_) => "with status" 
})

這給了相同的結果。

來源

2016-12-15 10:21:11 sgvd

謝謝@sgvd。我使用這兩種方法（並更新了問題以包含丟失的案例）。感謝你的幫助。 –

如何使用星火UDF

回答

相關問題