2016-03-05 58 views
1

隨着散列函數:partitionBy分配的分區,但其中在每一分區

balanceLoad = lambda x: bisect.bisect_left(boundary_array, -keyfunc(x)) 

哪裏boundary_array是[-64,-10,35]

如下因素告訴我哪個分區的每個元素分配給

rdd.partitionBy(numPartitions, balanceLoad) 

但是,有沒有一種方法來確定/控制他們分配/放置每個分區中的哪裏? {1,2,3}與{3,2,1}。

例如,當我這樣做:在每個分區

rdd = CleanRDD(sc.parallelize(range(100), 4).map(lambda x: (x *((-1) ** x) , x))) 

sortByKey(rdd, keyfunc=lambda key: key, ascending=False).collect() 

元素是按相反的順序:

[(64,64), (66,66), (68,68 ), (70,70), (72,72), (74,74), (76,76), (78,78), (80,80), (82,82) , (84,84), (86 ,86), (88,88), (90,90), (92,92), (94,94), (96,96), (98,98), (10, 10) (12,12), (14,14), (16,16), (18,18), (20,20), (22,22), (24,24 ), (26,26), (28,28), (30,30), (32,32), (34,34), (36,36), (38,38) , (40,40), (42,42), (44,44), (46,46), (48,48), (50,50), (52,52), (54,54), (56,56), (58,58), (60,60), (62,62), (-35,35), (-33,33), (-31,31), (-29,29 ), (-27,27), (-25,25), (-23,23), (-21,21), (-19,19), (-17,17), (-13,13), (-11,11), (-9,9),(-15,15), (-13,13),(-7,7), (-5,5), (-3,3), (-1,1), (0,0), (2,2), (4, 4) (6,6), (8,8), (-99,99), (-97,97), (-95,95), (-93,93), (-91,91), (-89,89), (-87,87), (-85,85), (-83,83), (-81,81), ( - (-73,73), (-77,77), (-75,75), (-73,73), (-71,71) , (-69,69), (-67,67), (-65,65), (-63,63), (-61,61), (-59,59), (-57,57), (-55,55), (-53,53), (-51,51), (-49,49), (-47,47), ( - 45,45), (-43,43), (-41,41), (-39,39), (-37,37)]

注意,在每個三組的元件是相反的順序。 我該如何解決這個問題?

回答

1

確定否,因爲洗牌的順序是非確定性的。

您可以控制順序,但不能作爲分區過程的一部分,或者至少不能在PySpark中進行。相反,你可以採取類似的方法一樣sortByKey和每個分區執行順序算賬:

def applyOrdering(iter): 
    """Takes an itertools.chain object 
    and returns iterable with specific ordering""" 
    ... 

rdd.partitionBy(numPartitions, balanceLoad).mapPartitions(applyOrdering) 

注意iter可能是大裝入到內存中,所以你應該增加粒度或使用排序機制,它不需要閱讀所有數據一次。