2017-07-26 113 views
0

如您所知,DataFrame可以包含複雜類型的字段,如結構(StructType)或數組(ArrayType)。在我的情況中,您可能需要將所有DataFrame數據映射到Hive表,並使用簡單的類型字段(String,Integer ...)。 我一直在努力解決這個問題很久,終於找到了我想分享的解決方案。 此外,我相信它可以改進,所以隨時回答你自己的建議。使用Scala中的不同數據類型展平數據框

它基於this thread,但也適用於ArrayType元素,不僅適用於StructType元素。 它是一個尾遞歸函數,它接收一個DataFrame並將其平坦化。

def flattenDf(df: DataFrame): DataFrame = { 
    var end = false 
    var i = 0 
    val fields = df.schema.fields 
    val fieldNames = fields.map(f => f.name) 
    val fieldsNumber = fields.length 

    while (!end) { 
    val field = fields(i) 
    val fieldName = field.name 

    field.dataType match { 
     case st: StructType => 
     val childFieldNames = st.fieldNames.map(n => fieldName + "." + n) 
     val newFieldNames = fieldNames.filter(_ != fieldName) ++ childFieldNames 
     val newDf = df.selectExpr(newFieldNames: _*) 
     return flattenDf(newDf) 
     case at: ArrayType => 
     val fieldNamesExcludingArray = fieldNames.filter(_ != fieldName) 
     val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode($fieldName) as a") 
     val fieldNamesToSelect = fieldNamesExcludingArray ++ Array("a.*") 
     val explodedDf = df.selectExpr(fieldNamesAndExplode: _*) 
     val explodedAndSelectedDf = explodedDf.selectExpr(fieldNamesToSelect: _*) 
     return flattenDf(explodedAndSelectedDf) 
     case _ => Unit 
    } 

    i += 1 
    end = i >= fieldsNumber 
    } 
    df 
} 
+0

對於初學者來說,'VAL字段名= df.schema.fieldNames':d – philantrovert

+0

做這項工作與數組類型?它不適合我。 :( –

回答

0

接聽拉梅什Maharjan(此,我必須這樣做,由於消息的長度),它爲我工作,我曾與一些例子試了一下。在每次迭代中打印DataFrame模式以查看發生了什麼。

我的一個例子:

root 
|-- A: string (nullable = true) 
|-- B: long (nullable = true) 
|-- C: string (nullable = true) 
|-- Ds: array (nullable = true) 
| |-- D: struct (nullable = true) 
| | |-- Es: array (nullable = true) 
| | | |-- E: string (nullable = true) 
| | |-- Fs: array (nullable = true) 
| | | |-- F: string (nullable = true) 
|-- G: string (nullable = true) 
|-- H: string (nullable = true) 
|-- I: string (nullable = true) 
|-- J: string (nullable = true) 
|-- K: string (nullable = true) 

root 
|-- A: string (nullable = true) 
|-- B: long (nullable = true) 
|-- C: string (nullable = true) 
|-- G: string (nullable = true) 
|-- H: string (nullable = true) 
|-- I: string (nullable = true) 
|-- J: string (nullable = true) 
|-- K: string (nullable = true) 
|-- D: struct (nullable = true) 
| |-- Es: array (nullable = true) 
| | |-- E: string (nullable = true) 
| |-- Fs: array (nullable = true) 
| | |-- F: string (nullable = true) 

root 
|-- A: string (nullable = true) 
|-- B: long (nullable = true) 
|-- C: string (nullable = true) 
|-- G: string (nullable = true) 
|-- H: string (nullable = true) 
|-- I: string (nullable = true) 
|-- J: string (nullable = true) 
|-- K: string (nullable = true) 
|-- Es: array (nullable = true) 
| |-- E: string (nullable = true) 
|-- Fs: array (nullable = true) 
| |-- F: string (nullable = true) 

root 
|-- A: string (nullable = true) 
|-- B: long (nullable = true) 
|-- C: string (nullable = true) 
|-- G: string (nullable = true) 
|-- H: string (nullable = true) 
|-- I: string (nullable = true) 
|-- J: string (nullable = true) 
|-- K: string (nullable = true) 
|-- Fs: array (nullable = true) 
| |-- F: string (nullable = true) 
|-- E: string (nullable = true) 

root 
|-- A: string (nullable = true) 
|-- B: long (nullable = true) 
|-- C: string (nullable = true) 
|-- G: string (nullable = true) 
|-- H: string (nullable = true) 
|-- I: string (nullable = true) 
|-- J: string (nullable = true) 
|-- K: string (nullable = true) 
|-- E: string (nullable = true) 
|-- F: string (nullable = true)