如何從DataFrame中選擇確切數量的隨機行

如何從DataFrame中有效選擇精確的隨機行數？數據包含可以使用的索引列。如果我必須使用最大大小，索引列上的count（）或max（）更有效嗎？如何從DataFrame中選擇確切數量的隨機行

2016-11-06 Boris

你就不能使用'df.sample（）'？ – mtoto

@mtoto sample（）返回一個近似數字，但在某些情況下，算法會請求一個確切的數字。 – Boris

一種可能的方法是計算使用.count()的行數，則使用從sample()python的random library以生成從該範圍內的任意長度的隨機序列。最後使用結果列表中的數字vals來爲您的索引列進行分類。

import random 
def sampler(df, col, records): 

    # Calculate number of rows 
    colmax = df.count() 

    # Create random sample from range 
    vals = random.sample(range(1, colmax), records) 

    # Use 'vals' to filter DataFrame using 'isin' 
    return df.filter(df[col].isin(vals))

例子：

df = sc.parallelize([(1,1),(2,1), 
        (3,1),(4,0), 
        (5,0),(6,1), 
        (7,1),(8,0), 
        (9,0),(10,1)]).toDF(["a","b"]) 

sampler(df,"a",3).show() 
+---+---+ 
| a| b| 
+---+---+ 
| 3| 1| 
| 4| 0| 
| 6| 1| 
+---+---+

來源

2016-11-06 22:49:12 mtoto

感謝您的建議。這也是我所接受的。我不想使用此解決方案的原因是使用** count（）**方法，這非常昂貴。 – Boris

你也可以緩存你的'df'，然後在函數外部計算'count（）'，或者使用'agg（max）'。 – mtoto

謝謝，在Java中使用了你的解決方案。 – Boris

如何從DataFrame中選擇確切數量的隨機行

回答

相關問題