Saprk DataFrame中列（結構類型）的掩碼字段

我從XML文件創建了一個DataFrame。創建的DataFrame具有以下方案。Saprk DataFrame中列（結構類型）的掩碼字段

val df = hiveContext.read.format("com.databricks.spark.xml").option("rowTag", row_tag_name).load(data_dir_path_xml) 

df.printSchema() 

      root 
      |-- samples: struct (nullable = true) 
      | |-- sample: array (nullable = true) 
      | | |-- element: struct (containsNull = true) 
      | | | |-- abc: string (nullable = true) 
      | | | |-- def: long (nullable = true) 
      | | | |-- type: string (nullable = true) 
      |-- abc: string (nullable = true)

我想掩蓋數據框中的abc/def。

我能得到我想要使用領域：

val abc = df.select($"samples.sample".getField("abc"))

，但我想掩蓋現場ABC/DEF（與XXXX替換ABC場）的數據幀DF。請幫我解決這個問題

來源

2017-05-11 Raj

你是什麼面具ABC/DEF是什麼意思？是不是要用def值掩蓋abc？ –

我想用值'xxxxx'替換字段'abc'和'def'。這些字段是敏感數據。 – Raj

你想替換列的值對嗎？ –

databricks xml庫似乎沒有太多的支持來處理基於XML的數據框的內容（如果有的話，能夠使用XSLT是不是很酷？！）。但是你總是可以直接操縱推斷的行，例如

val abc = df.map(row => { 
    val samples = row.getStruct(0).getSeq(0) 
    val maskedSamples = samples.map(sample => { 
    Row("xxxxx", sample.getLong(1), sample.getString(2)) 
    } 
    Row(Row(maskedSamples), row.getString(1)) 
}

上面的代碼可能無法精確匹配所需的改造，因爲它有點不清楚，但你的想法。

來源

2017-05-11 21:23:37 halversonp

我建議你將samples arraystructType拆分爲columns（StructFields），這樣你可以根據自己的需要來屏蔽/替換它們。如果您願意，也可以稍後申請dataframe functions。
下面是分成三列

df.withColumn("abcd", lit($"samples.sample.abc")) 
     .withColumn("def", lit($"samples.sample.def")) 
     .withColumn("type", lit($"samples.sample.type"))

您可以刪除samples column如果你想

.drop("samples")

既然你想掩蓋abc和def與XXXX的代碼，你可以做

df.withColumn("abcd", lit("XXXX")) 
     .withColumn("def", lit("XXXX")) 
     .withColumn("type", lit($"samples.sample.type")) 
     .drop("samples")

注意：abcd column name是使用d由於已有另一columnabc架構中的

編輯，以符合以下@Raj評論：

如果original schema將被保留，不需要單獨columns隨後的case class和創建udf功能應該做的伎倆

def mask = udf((typ: mutable.WrappedArray[String]) => Raj("XXXXX", Option(0L), typ(0)))

Case class爲Raj需要

case class Raj(abc : String, 
       dfe : Option[Long], 
       typ: String)

最終通過調用udf功能type在withColumn

df.withColumn("samples", struct(array(mask(col("samples.sample.type"))) as "sample"))

這應該讓你的工作輸出

來源

2017-05-12 17:12:16

我試過使用df.withColumn。這樣做將會創建一個數據幀，其模式爲根 | - abcd：string（nullable = true） | - def：long（nullable = true） | - type：string（nullable = true） | - abc：string（nullable = true）但是我想要與原始DF相同的模式。 – Raj

@Raj，我已經更新了您的期望輸出的帖子。 –

Saprk DataFrame中列（結構類型）的掩碼字段

回答

相關問題