2017-05-30 84 views
2

https://spark.apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth轉換階FP增長RDD輸出到數據幀

sample_fpgrowth.txt可以找到這裏, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

我跑上述階其工作正常鏈接中的FP-生長的例子,但我需要的是,如何將RDD中的結果轉換爲數據幀。 這兩種RDD

model.freqItemsets and 
model.generateAssociationRules(minConfidence) 

詳細解釋一下,在我的問題給出的例子。

+1

的可能的複製[如何RDD對象轉換爲數據框火花(https://stackoverflow.com/questions/29383578/how-to-convert-rdd -object-to-dataframe-in-spark) – stefanobaghino

+0

我試過我有錯誤,可能是因爲我是scala新手。你能否詳細解釋我的問題給出的例子。 –

+0

@ zero323你能幫助我通過我的問題 –

回答

2

一旦您擁有rdd,有許多方法可以創建dataframe。其中之一是使用.toDF功能需要sqlContext.implicits庫是imported

val sparkSession = SparkSession.builder().appName("udf testings") 
    .master("local") 
    .config("", "") 
    .getOrCreate() 
val sc = sparkSession.sparkContext 
val sqlContext = sparkSession.sqlContext 
import sqlContext.implicits._ 

後您閱讀fpgrowth文本文件和隱蔽到rdd

val data = sc.textFile("path to sample_fpgrowth.txt that you have used") 
    val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' ')) 

我從Frequent Pattern Mining - RDD-based API使用的代碼即問題中提供的內容

val fpg = new FPGrowth() 
    .setMinSupport(0.2) 
    .setNumPartitions(10) 
val model = fpg.run(transactions) 

下一步將調用.toDF功能

對於第一dataframe

model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false) 

這將導致到

+---------+----+ 
|items |freq| 
+---------+----+ 
|[z]  |5 | 
|[x]  |4 | 
|[x,z] |3 | 
|[y]  |3 | 
|[y,x] |3 | 
|[y,x,z] |3 | 
|[y,z] |3 | 
|[r]  |3 | 
|[r,x] |2 | 
|[r,z] |2 | 
|[s]  |3 | 
|[s,y] |2 | 
|[s,y,x] |2 | 
|[s,y,x,z]|2 | 
|[s,y,z] |2 | 
|[s,x] |3 | 
|[s,x,z] |2 | 
|[s,z] |2 | 
|[t]  |3 | 
|[t,y] |3 | 
+---------+----+ 
only showing top 20 rows 

用於第二dataframe

val minConfidence = 0.8 
model.generateAssociationRules(minConfidence) 
    .map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence)) 
    .toDF("antecedent", "consequent", "confidence").show(false) 

,這將導致對

+----------+----------+----------+ 
|antecedent|consequent|confidence| 
+----------+----------+----------+ 
|[t,s,y] |[x]  |1.0  | 
|[t,s,y] |[z]  |1.0  | 
|[y,x,z] |[t]  |1.0  | 
|[y]  |[x]  |1.0  | 
|[y]  |[z]  |1.0  | 
|[y]  |[t]  |1.0  | 
|[p]  |[r]  |1.0  | 
|[p]  |[z]  |1.0  | 
|[q,t,z] |[y]  |1.0  | 
|[q,t,z] |[x]  |1.0  | 
|[q,y]  |[x]  |1.0  | 
|[q,y]  |[z]  |1.0  | 
|[q,y]  |[t]  |1.0  | 
|[t,s,x] |[y]  |1.0  | 
|[t,s,x] |[z]  |1.0  | 
|[q,t,y,z] |[x]  |1.0  | 
|[q,t,x,z] |[y]  |1.0  | 
|[q,x]  |[y]  |1.0  | 
|[q,x]  |[t]  |1.0  | 
|[q,x]  |[z]  |1.0  | 
+----------+----------+----------+ 
only showing top 20 rows 

我希望這是你需要

+0

回答這是我期待謝謝噸。 –

+0

我的榮幸@ArunGunalan :)很高興答案幫助你 –