在斯卡拉你怎麼加入2 RDD

如果我有2個RDD定義爲：在斯卡拉你怎麼加入2 RDD

Sample(Key1,EventDate,Value1) 
Sample2(Key1,ExecutionDate, Label1)

我想加入這兩個RDD，這樣我可以決定是否鍵1的樣品2中存在，然後將完整的結果分開到2個新RDDS：1包含密鑰1存在於樣品2的人，其他的人會擁有所有鍵1它不樣品2

存在

FoundKey1(Key1, EventDate,Value1) 
NotFoundKey1(Key1, ExecutionDate,Label1)

基本上我想是這樣的，我在SQL這樣做：

Select Sample.Key1, Sample.EventDate. Key1.Value 
from Sample 
where NOT EXISTS (select 1 from Sample2 where Sample2.Key1 = Sample.Key1)

而對於其他表

SELECT Sample.Key1, Sample.EventDate, Sample.Value1 
from Sample right join Sample2 
on (Sample.Key1 = Sample2.Key2);

樣品RDD值：

Sample(1, 2016-01-05, 10) 
    Sample(1, 2016-01-05, 10) 
    Sample(2, 2016-01-05, 10) 
    Sample(2, 2016-01-05, 10) 
    Sample(3, 2016-01-05, 10) 

    Sample(1, 2016-01-05, A) 
    Sample(3, 2016-01-05, A) 
    Sample(5, 2016-01-05, B) 
    Sample(6, 2016-01-05, C) 
    Sample(7, 2016-01-05, C)

我忘記之前，我的RDD被定義爲RDD [可迭代[TESTDATA]和TESTDATA與A類值（鍵1，EVENTDATE，值）樣品和TestData2 =（密鑰1，ExecutionDate，標籤）

這裏是我到目前爲止已經試過：

val grpSample.groupBy(_.Key1).map(_._2) 
    val grpSample2.groupBy(_.Key2).map(_._2) 
    val interSect = grpSample.intersection.grpSample2

我運行此代碼，看看我的分組它，我得到一個錯誤

來源

2017-02-10 E B

讓我們瞭解您到目前爲止試過... –

最好的才能將它們轉換的數據幀，即SQL火花然後根據您的條件直接調用連接方法 –

@Akashi ..有點新的Spark ..所以當你說轉換爲DataFrame我怎麼能實現這一點？ –

val rdd1=sample.groupBy(_.Key1) 
val rdd2=sample2.groupBy(_.key1) 

//to get data for which key exists in both rdd 
val result1= rdd1 join rdd2 map (_._2) 

//to get data for which key exists in first but not in second rdd 
val tempresult= rdd1 fullOuterJoin rdd2 
val result2= tempresult filter(_._2._2.isEmpty) map (_._2._1.get)

來源

2017-02-10 10:43:26

如何將結果轉換爲結果1：org.apache.spark.rdd.RDD [Iterable [TestData]] = MapPartitionsRDD [47]在地圖：58 tempresult：org.apache.spark.rdd .RDD [（String，（Option [Iterable [TesData]]，Option [Iterable [TestDat2]]））] = MapPartitionsRDD [50] at ：57 result2：org.apache.spark.rdd.RDD [Option [Iterable [TestData]]] = MapPartitionsRDD [52]在地圖：59因此我可以訪問原始的（Key1，EventDate，Value） –

try，result2 = tempresult filter（_._ 2._2.isEmpty）map（_ ._2._1.get） –

您認爲該解決方案有幫助嗎？ –

在斯卡拉你怎麼加入2 RDD

回答

相關問題