斯卡拉正則表達式UDF搶查詢參數值，並將其轉換爲以逗號分隔的列表

我有一個類似於下面的數據：斯卡拉正則表達式UDF搶查詢參數值，並將其轉換爲以逗號分隔的列表

one=1&two=22222&three=&four=4f4

正如你所看到的，對於變量三值缺失。我想使用Scala正則表達式來獲取所有值並返回逗號分隔。

所需的輸出：

1,22222,,4f4

另一個更需要的話，可能的輸出：

1,22222,undefined,4f4

這是我當前的代碼（我用星火2.0斯卡拉的數據幀）：

def main(args: Array[String]) { 
    ... 
    val pattern : scala.util.matching.Regex = """[^&?]*?=([^&?]*)""".r 
    df.select(transform(pattern)($"data").alias("csvData")).take(100).foreach(println) 
} 

def transform(pattern: scala.util.matching.Regex) = udf(
(dataMapping: String) => pattern.findAllIn(dataMapping).toList 
)

其中返回：

[WrappedArray(one=1, two=22222, three=, four=4f4)] 
[WrappedArray(...)]

我認爲我可以在我的「transform」udf函數上做得更好，但我對Scala非常陌生，並且不確定如何匹配第一組並返回逗號分隔。我想我會在我的解決方案中使用類似m => m.group（1）的東西，但我不確定。謝謝你的建議。

來源

2016-11-14 satoukum

如果你有多個列，你可能是最好關閉用UDF：

scala> val df = Seq(("one=1&two=22222&three=&four=4f4", 1)).toDF("a", "b") 
df: org.apache.spark.sql.DataFrame = [a: string, b: int] 

scala> df.show 
+--------------------+---+ 
|     a| b| 
+--------------------+---+ 
|one=1&two=22222&t...| 1| 
+--------------------+---+ 

scala> val p = """[one|two|three|four]\=([\d|\W|\w]+)""".r 
p: scala.util.matching.Regex = [one|two|three|four]\=([\d|\W|\w]+) 

scala> :pa 
// Entering paste mode (ctrl-D to finish) 

val regexUDF = udf((x: String) => 
    x.split("&").map(p.findFirstMatchIn(_).map(_.group(1)).getOrElse(null))) 
    ) 

// Exiting paste mode, now interpreting. 

regexUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(StringType))) 

scala> val df2 = df.withColumn("a", regexUDF($"a")) 
df2: org.apache.spark.sql.DataFrame = [a: array<string>, b: int] 

scala> df2.show 
+--------------------+---+ 
|     a| b| 
+--------------------+---+ 
|[1, 22222, null, ...| 1| 
+--------------------+---+ 


scala> df2.collect.foreach{println} 
[WrappedArray(1, 22222, null, 4f4),1]

來源

2016-11-14 23:00:09

有沒有辦法來剛纔的1，22222 ......在值列，而不是包括列名和=標誌？ – satoukum

另外，如果數據幀有很多列，我如何指定我想要分割稱爲數據的列？ – satoukum

@satoukum看我的編輯 –

斯卡拉正則表達式UDF搶查詢參數值，並將其轉換爲以逗號分隔的列表

回答

相關問題