2017-08-10 76 views
3

我想從col1存在於col2刪除字符串時:火花柱字符串替換存在於其它列(行)

val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"), 
("I wish Java could use case classes", "Java"), 
("Logistic regression models are neat", "models") 
)).toDF("sentence", "label") 

使用regexp_replacetranslate REF:spark functions api

val res = df.withColumn("sentence_without_label", regexp_replace 
(col("sentence") , "(?????)", "")) 

所以res看起來如下:

enter image description here

回答

3

你可以簡單地使用

df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit(""))) 

,或者您可以使用簡單的UDF函數如下

val df5 = spark.createDataFrame(Seq(
    ("Hi I heard about Spark", "Spark"), 
    ("I wish Java could use case classes", "Java"), 
    ("Logistic regression models are neat", "models") 
)).toDF("sentence", "label") 

val replace = udf((data: String , rep : String)=>data.replaceAll(rep, "")) 

val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label")) 

res.show() 

輸出:

+-----------------------------------+------+------------------------------+ 
|sentence       |label |sentence_without_label  | 
+-----------------------------------+------+------------------------------+ 
|Hi I heard about Spark    |Spark |Hi I heard about    | 
|I wish Java could use case classes |Java |I wish could use case classes| 
|Logistic regression models are neat|models|Logistic regression are neat | 
+-----------------------------------+------+------------------------------+ 
+2

沒有必要在這裏的UDF – mtoto

5

如果label它只是一個文字這是很簡單:

import org.apache.spark.sql.functions._ 

df.withColumn("sentence_without_label", 
    regexp_replace(col("sentence"), col("label"), lit(""))).show(false) 

+-----------------------------------+------+------------------------------+ 
|sentence       |label |sentence_without_label  | 
+-----------------------------------+------+------------------------------+ 
|Hi I heard about Spark    |Spark |Hi I heard about    | 
|I wish Java could use case classes |Java |I wish could use case classes| 
|Logistic regression models are neat|models|Logistic regression are neat | 
+-----------------------------------+------+------------------------------+ 

在星火1.6,你可以做同樣的expr

df.withColumn(
    "sentence_without_label", 
    expr("regexp_replace(sentence, label, '')"))