2017-09-24 67 views

回答

1

請直接調用火花斯卡拉LIB API:

def distinct(): RDD[T] 

請記住,這是一個類型參數的通用方法。

如果使用RDD類型的RDD [(Int,Int)]調用它,它會在您的rdd中爲您提供不同的類型對(Int,Int),就像它一樣。


如果你想看到這個方法的內部。請參閱下面的簽名:

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { 
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) 
    } 
0

您可以使用不同的例如

val data= sc.parallelize(
    Seq(
    ("Foo","41","US","3"), 
    ("Foo","39","UK","1"), 
    ("Bar","57","CA","2"), 
    ("Bar","72","CA","2"), 
    ("Baz","22","US","6"), 
    ("Baz","36","US","6"), 
    ("Baz","36","US","6") 
) 
) 

刪除重複:

val distinctData = data.distinct() 
distinctData.collect 
相關問題