去除數據框中行

我有特殊字符，如在一個數據集如下所示：去除數據框中行

! Hello World. 1 
" Hi there. 0

我想要做的，是從每一行的開頭（剛從刪除所有的特殊字符開始，而不是特殊字符的其餘部分）。

爲了讀取數據（製表符分隔）我使用下面的代碼：

val data = sparkSession.read.format("com.databricks.spark.csv") 
    .option("delimiter", "\t") 
    .load("data.txt") 

val columns = Seq("text", "class") 
val df = data.toDF(columns: _*)

我知道，我應該使用replaceAll()但我不太清楚如何做到這一點。

來源

2017-03-09 Giorgos Myrianthous

您可以創建一個udf並將其應用到您的數據幀刪除前導特殊字符的第一列：

val df = Seq(("! Hello World.", 1), ("\" Hi there.", 0)).toDF("text", "class") 

df.show 
+--------------+-----+ 
|   text|class| 
+--------------+-----+ 
|! Hello World.| 1| 
| " Hi there.| 0| 
+--------------+-----+  


import org.apache.spark.sql.functions.udf 
                 ^
// remove leading non-word characters from a string 
def remove_leading: String => String = _.replaceAll("^\\W+", "")  
val udf_remove = udf(remove_leading) 

df.withColumn("text", udf_remove($"text")).show 
+------------+-----+ 
|  text|class| 
+------------+-----+ 
|Hello World.| 1| 
| Hi there.| 0| 
+------------+-----+

來源

2017-03-09 13:48:08 Psidom

也許它會幫助

val str = " some string " 
str.trim

或修剪一些特定的字符

str.stripPrefix(",").stripSuffix(",").trim

或從正面除去一些字符

val ignoreable = ", \t\r\n" 
str.dropWhile(c => ignorable.indexOf(c) >= 0)

全部用繩子有用OPS可以發現at

來源

2017-03-09 13:42:28 FaigB

去除數據框中行

回答

相關問題