2017-07-06 124 views
0

我有一個String類型的列「時代」一個數據幀,我想獲得包含字符串格式如何在數據幀的基礎添加列於其他列的值火花

範圍範圍內的新列如下

[-1,12,17,24,34,44,54,64,100,1000]

輸入值例如

Age 
===== 
-1 
12 
18 
28 
38 
46 
====== 

輸出所需

Age Age-Range 
===== ========= 
-1  (-1,12) 
12  (-1,12) 
18  (12-17) 
28  (24-34) 
38  (34-44) 
46  (44-54) 
====== ========== 

任何建議或幫助是高度讚賞

回答

2

這裏有一個快速的建議,我希望它能幫助:

case class AgeRange(lowerBound: Int, upperBound: Int) { 
    def contains(value: Int): Boolean = value >= lowerBound && value < upperBound 
} 

val rangeList = List(-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000) 
val ranges = rangeList.sliding(2).map((list => AgeRange(list(0), list(1)))).toList 
val dataset = Seq("-1", "12", "18", "28", "38", "46").toDS 

def findRange(value: Int, ageRanges: List[AgeRange]): Option[AgeRange] = ageRanges.find(_.contains(value)) 

// With UDF 
def myUdf(ageRanges: List[AgeRange]) = udf{ 
    i: Int => findRange(i, ageRanges) 
} 

val result1 = dataset.toDF("age").withColumn("age_range", myUdf(ranges)(col("age").cast("int"))) 

// With map 
val result2 = dataset.map { 
    i: String => (i, findRange(i.toInt, ranges)) 
}.toDF("age", "age_range") 

結果造成:

result1: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>] 
result2: org.apache.spark.sql.DataFrame = [age: string, age_range: struct<lowerBound: int, upperBound: int>] 
+---+---------+ 
|age|age_range| 
+---+---------+ 
| -1| [-1,12]| 
| 12| [12,17]| 
| 18| [17,24]| 
| 28| [24,34]| 
| 38| [34,44]| 
| 46| [44,54]| 
+---+---------+ 
+0

非常感謝丹尼爾!!! ....它爲我工作!!! ... – Bhavesh

1

您可以使用UDF函數as

def range = udf((age: String) => { 
    val array = Array(-1, 12, 17, 24, 34, 44, 54, 64, 100, 1000) 
    val ageInt = age.toInt 
    array.filter(i => i <= ageInt).last.toString+"-"+array.filter(i => i > ageInt).head.toString 
}) 

,並打電話給你的數據幀作爲

df.withColumn("Age-Range", range($"Age")) 

你應該有輸出

+---+---------+ 
|Age|Age-Range| 
+---+---------+ 
|-1 |-1-12 | 
|12 |12-17 | 
|18 |17-24 | 
|28 |24-34 | 
|38 |34-44 | 
|46 |44-54 | 
+---+---------+ 

最終輸出是不是你需要,但應該給你足夠多的想法,正確的解決方案更多。

2

這是使用UDF的簡單解決方案,但您需要手動創建列表。

//dataframe with column age 
val df = spark.sparkContext.parallelize(Seq("-1", "12", "18", "28", "38", "38", "388", "3", "41")).toDF("Age") 

val updateUDF = udf((age : String) => { 
    val range = Seq(
    (-1, 12, "(-1 - 12)"), 
    (12, 17, "(12 - 17)"), 
    (17, 24, "(17 - 24)"), 
    (24, 34, "(24 - 34)"), 
    (34, 44, "(34 - 44)"), 
    (44, 54, "(44 - 54)"), 
    (54, 64, "(54 - 64)"), 
    (64, 10, "(64 - 100)"), 
    (100, 1000, "(100- 1000)") 
) 
range.map(value => { 
    if (age.toInt >= value._1 && age.toInt < value._2) value._3 
    else "" 
}).filter(!_.equals(""))(0) 

}) 

    df.withColumn("Age-Range", updateUDF($"Age")).show(false) 

Here is the output: 
+---+-----------+ 
|Age|Age-Range | 
+---+-----------+ 
|-1 |(-1 - 12) | 
|12 |(12 - 17) | 
|18 |(17 - 24) | 
|28 |(24 - 34) | 
|38 |(34 - 44) | 
|38 |(34 - 44) | 
|388|(100- 1000)| 
|3 |(-1 - 12) | 
|41 |(34 - 44) | 
+---+-----------+ 

我希望這有助於!

+0

非常感謝!!! ... – Bhavesh