2017-02-10 50 views
0

如果我有2個RDD定義爲:在斯卡拉你怎麼加入2 RDD

Sample(Key1,EventDate,Value1) 
Sample2(Key1,ExecutionDate, Label1) 

我想加入這兩個RDD,這樣我可以決定是否鍵1的樣品2中存在,然後將完整的結果分開到2個新RDDS:1包含密鑰1存在於樣品2的人,其他的人會擁有所有鍵1它不樣品2

存在
FoundKey1(Key1, EventDate,Value1) 
NotFoundKey1(Key1, ExecutionDate,Label1) 

基本上我想是這樣的,我在SQL這樣做:

Select Sample.Key1, Sample.EventDate. Key1.Value 
from Sample 
where NOT EXISTS (select 1 from Sample2 where Sample2.Key1 = Sample.Key1) 

而對於其他表

SELECT Sample.Key1, Sample.EventDate, Sample.Value1 
from Sample right join Sample2 
on (Sample.Key1 = Sample2.Key2); 

樣品RDD值:

Sample(1, 2016-01-05, 10) 
    Sample(1, 2016-01-05, 10) 
    Sample(2, 2016-01-05, 10) 
    Sample(2, 2016-01-05, 10) 
    Sample(3, 2016-01-05, 10) 

    Sample(1, 2016-01-05, A) 
    Sample(3, 2016-01-05, A) 
    Sample(5, 2016-01-05, B) 
    Sample(6, 2016-01-05, C) 
    Sample(7, 2016-01-05, C) 

我忘記之前,我的RDD被定義爲RDD [可迭代[TESTDATA]和TESTDATA與A類值(鍵1,EVENTDATE,值)樣品和TestData2 =(密鑰1,ExecutionDate,標籤)

這裏是我到目前爲止已經試過:

val grpSample.groupBy(_.Key1).map(_._2) 
    val grpSample2.groupBy(_.Key2).map(_._2) 
    val interSect = grpSample.intersection.grpSample2 

我運行此代碼,看看我的分組它,我得到一個錯誤

+0

讓我們瞭解您到目前爲止試過... –

+0

最好的才能將它們轉換的數據幀,即SQL火花然後根據您的條件直接調用連接方法 –

+0

@Akashi ..有點新的Spark ..所以當你說轉換爲DataFrame我怎麼能實現這一點? –

回答

0
val rdd1=sample.groupBy(_.Key1) 
val rdd2=sample2.groupBy(_.key1) 

//to get data for which key exists in both rdd 
val result1= rdd1 join rdd2 map (_._2) 

//to get data for which key exists in first but not in second rdd 
val tempresult= rdd1 fullOuterJoin rdd2 
val result2= tempresult filter(_._2._2.isEmpty) map (_._2._1.get) 
+0

如何將結果轉換爲結果1:org.apache.spark.rdd.RDD [Iterable [TestData]] = MapPartitionsRDD [47]在地圖:58 tempresult:org.apache.spark.rdd .RDD [(String,(Option [Iterable [TesData]],Option [Iterable [TestDat2]]))] = MapPartitionsRDD [50] at :57 result2:org.apache.spark.rdd.RDD [Option [Iterable [TestData]]] = MapPartitionsRDD [52]在地圖:59因此我可以訪問原始的(Key1,EventDate,Value) –

+0

try,result2 = tempresult filter(_._ 2._2.isEmpty)map(_ ._2._1.get) –

+0

您認爲該解決方案有幫助嗎? –