熊貓複製一行填充DataFrame

我被困在一個死衚衕，我使用了一些代碼，這是決定性的非熊貓應該是一個非常簡單的任務熊貓。我確定有更好的方法。熊貓複製一行填充DataFrame

我有一個數據幀，我將提取一行，並創建一個新的數據幀，像這樣：

>>> sampledata 
float_col int_col str_col r v new_coltest  eddd 
0  0.1  1  a 5 1.0   0.1 -0.539783 
1  0.2  2  b 5 NaN   0.2 -1.394550 
2  0.2  6 None 5 NaN   0.2 0.290157 
3  10.1  8  c 5 NaN   10.1 -1.799373 
4  NaN  -1  a 5 NaN   NaN 0.694682 
>>> newsampledata = sampledata[(sampledata.new_coltest == 0.1) & (sampledata.float_col == 0.1)] 
>>> newsampledata 
float_col int_col str_col r v new_coltest  eddd 
0  0.1  1  a 5 1.0   0.1 -0.539783

我想要做的就是複製「newsampledata」 N倍單行線，其中n是一個已知的整數。理想情況下，帶有n行的最終DataFrame會覆蓋單行「newsampledata」，但這並不重要。

我正在使用for循環執行pd.concat n-1次以獲取DataFrame填充，但由於concat的工作原理，這不是快速的。我也嘗試了使用append的相同類型的策略，而這比concat稍慢。

我已經看到有關類似項目的其他一些問題，但很多人還沒有看到過這個確切的問題。另外，由於性能方面的考慮，我一直偏離地圖/應用，但如果您已經看到了這種方法的良好表現，請告訴我，我也會嘗試。

TIA

來源

2016-12-06 rajan

您可以使用DataFrame構造：

N = 10 
df =pd.DataFrame(newsampledata.values.tolist(),index=np.arange(N),columns=sampledata.columns) 
print (df) 
    float_col int_col str_col r v new_coltest  eddd 
0  0.1  1  a 5 1.0   0.1 -0.539783 
1  0.1  1  a 5 1.0   0.1 -0.539783 
2  0.1  1  a 5 1.0   0.1 -0.539783 
3  0.1  1  a 5 1.0   0.1 -0.539783 
4  0.1  1  a 5 1.0   0.1 -0.539783 
5  0.1  1  a 5 1.0   0.1 -0.539783 
6  0.1  1  a 5 1.0   0.1 -0.539783 
7  0.1  1  a 5 1.0   0.1 -0.539783 
8  0.1  1  a 5 1.0   0.1 -0.539783 
9  0.1  1  a 5 1.0   0.1 -0.539783 

print (df.dtypes) 
float_col  float64 
int_col   int64 
str_col   object 
r    int64 
v    float64 
new_coltest float64 
eddd   float64 
dtype: object

個

時序：

是小DataFrame更快sample和reindex方法，在大型DataFrame構造方法。

N = 1000 
In [88]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns)) 
1000 loops, best of 3: 745 µs per loop 

In [89]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True)) 
The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 470 µs per loop 

In [90]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 
1000 loops, best of 3: 476 µs per loop

N = 10000 
In [92]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns)) 
1000 loops, best of 3: 946 µs per loop 

In [93]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True)) 
1000 loops, best of 3: 775 µs per loop 

In [94]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 
1000 loops, best of 3: 827 µs per loop

N = 100000 
In [97]: %timeit (pd.DataFrame(newsampledata.values.tolist(), index=np.arange(N), columns=sampledata.columns)) 
The slowest run took 12.98 times longer than the fastest. This could mean that an intermediate result is being cached. 
100 loops, best of 3: 6.93 ms per loop 

In [98]: %timeit (newsampledata.sample(N, replace=True).reset_index(drop=True)) 
100 loops, best of 3: 7.07 ms per loop 

In [99]: %timeit (newsampledata.reindex(newsampledata.index.repeat(N)).reset_index(drop=True)) 
100 loops, best of 3: 7.87 ms per loop

來源

2016-12-06 07:29:37 jezrael

良好的解決方案的一個，似乎工作沒有問題，我同意，它更快。不知道如何設置索引，將不得不記住這一個！ – rajan

在以前的版本中，你有一個numpy版本，缺點是轉換爲object的dtypes。當回到原始數據類型時，這個解決方案如何比較性能？也許numpy仍然更快;） – Quickbeam2k1

@ Quickbeam2k1 - 我嘗試。 – jezrael

我想你可以只sample它更換

newsampledata.sample(n, replace=True).reset_index(drop=True)

或reindex

newsampledata.reindex(newsampledata.index.repeat(n)).reset_index(drop=True)

來源

2016-12-06 07:27:48

我認爲你可以使用CONCAT不使用for循環明確。

df = pd.DataFrame({'a':[1], 'b':[.1]}) 
repetitions = 4 
res = pd.concat([df]*repetitions) 
print(res)

輸出

所以我的樣品架上，這的確是快於大約5倍使用循環然而，我期望不同的解決方案不使用CONCAT是顯著更快。

爲了展示豪慢CONCAT是，相比一些基準來jezrael的解決方案

來源

2016-12-06 07:47:16 Quickbeam2k1

當天晚些時候concat是非常緩慢的。一行數據幀花了1.5s，n = 10,000 –

你是對的。但是，這個解決方案至少比直接使用for循環更快。 – Quickbeam2k1

針對jezraels解決方案執行了一些基準測試，以顯示concat的緩慢程度 – Quickbeam2k1

的bajillion方法可以做到這

pd.concat([df.query('new_coltest == 0.1 & float_col == 0.1')] * 4)

來源

2016-12-06 07:55:59 piRSquared

熊貓複製一行填充DataFrame

回答

相關問題