熊貓給出了錯誤的意思

-3

我正在將多個數據文件讀入數據框並計算平均值。在我連接數據框後，我再次計算平均值，但熊貓給我的錯誤答案。熊貓給出了錯誤的意思

temp = pd.read_csv(appDelayFile, delimiter='\t') 
temp = temp.groupby(['Type', 'Node']).mean() 
temp = temp.ix['FullDelay'] 

d = pd.concat([d, temp]) 
print d # separate parsed data frames 
d = d.groupby(d.index).mean() 
print d # after calculating the mean

在第一次印刷我得到('0.574193', '0.441335', and '2.71299')，其平均值爲'1.2428393333'。但第二次印刷給我'1.610377'。

代碼有問題嗎？或者這是一個錯誤？

**編輯**

樣本數據文件1：

Time  Node AppId SeqNo Type  DelayS  RetxCount HopCount 
0.054701 25 1 0 LastDelay 0.054701 1 8 
0.054701 25 1 0 FullDelay 0.054701 1 8 
0.00708243 26 1 0 LastDelay 0.00708243 1 2 
0.00708243 26 1 0 FullDelay 0.00708243 1 2 
0.036943 25 1 0 LastDelay 0.036943 1 6 
0.036943 25 1 0 FullDelay 0.036943 1 6 
0.0582151 26 1 0 LastDelay 0.0582151 1 12 
0.0582151 26 1 0 FullDelay 0.0582151 1 12

樣本數據文件2：

Time  Node AppId SeqNo Type  DelayS  RetxCount HopCount 
0.0252673 25 1 0 LastDelay 0.0252673 1 6 
0.0252673 25 1 0 FullDelay 0.0252673 1 6 
0.00655327 26 1 0 LastDelay 0.00655327 1 2 
0.00655327 26 1 0 FullDelay 0.00655327 1 2 
0.023523 25 1 0 LastDelay 0.023523 1 8 
0.023523 25 1 0 FullDelay 0.023523 1 8 
0.0380394 26 1 0 LastDelay 0.0380394 1 4 
0.0380394 26 1 0 FullDelay 0.0380394 1 4

樣本數據文件3：

Time  Node AppId SeqNo Type  DelayS  RetxCount HopCount 
0.0276086 25 1 0 LastDelay 0.0276086 1 8 
0.0276086 25 1 0 FullDelay 0.0276086 1 8 
0.0197642 26 1 0 LastDelay 0.0197642 1 4 
0.0197642 26 1 0 FullDelay 0.0197642 1 4 
0.00708267 25 1 0 LastDelay 0.00708267 1 2 
0.00708267 25 1 0 FullDelay 0.00708267 1 2 
0.00708268 26 1 0 LastDelay 0.00708268 1 2 
0.00708268 26 1 0 FullDelay 0.00708268 1 2

已析數據文件：

   Time AppId  SeqNo DelayS  DelayUS RetxCount HopCount 
25 0.045822  1  0 0.045822 45822.000   1   7 
26 0.032649  1  0 0.032649 32648.765   1   7 
      Time AppId SeqNo DelayS DelayUS RetxCount HopCount 
Node                 
25 0.024395  1  0 0.024395 24395.150   1   7 
26 0.022296  1  0 0.022296 22296.335   1   3 
      Time AppId SeqNo DelayS DelayUS RetxCount HopCount 
Node                 
25 0.017346  1  0 0.017346 17345.635   1   5 
26 0.013423  1  0 0.013423 13423.440   1   3

第二打印示出了數據幀的平均（這是錯誤的）：

  Time AppId  SeqNo DelayS  DelayUS RetxCount HopCount 
25 0.026227  1  0 0.026227 26227.105   1   6 
26 0.020448  1  0 0.020448 20447.995   1   4

這是print temp = temp.groupby(['Type', 'Node']).count()

  Time AppId  SeqNo DelayS  DelayUS RetxCount HopCount 
Node               
25  2  2  2  2  2   2   2 
26  2  2  2  2  2   2   2

來源

2017-05-31 John

你可以發佈你使用的數據嗎？什麼是'd'？ – darthbith

對於調試問題，您需要提供[mcve]。請包括一些顯示問題的數據集（儘可能簡短，並在最佳情況下複製和粘貼）。 :) – MSeifert

你可以顯示'print'語句的_actual_輸出嗎？第二個打印語句應該顯示一個DataFrame（或可能是一個Series），而不是一個單獨的值。 –

的輸出。如果組大小是不是所有的相等，你不應該期望整體意思是mean of the group means。

看

temp.groupby(['Type', 'Node']).count()

您將看到羣體有不同的尺寸。

如果你想的手段來搭配，你可以做一個加權平均，如以下

from __future__ import print_function, division 
import pandas as pd 
import numpy as np 
np.random.seed(10) 
df = pd.DataFrame(np.random.randint(0, 3, size=(20,2)), 
        columns=['node', 'type']) 
df['delay'] = np.random.uniform(size=20) 
grouper = df.groupby(['type', 'node']).delay 
print(df.delay.mean()) 
# 0.512131169932 
print(grouper.mean().mean()) 
# 0.55613710694900131 
print ((grouper.mean() * grouper.count()).sum()/df.index.size) 
# 0.51213116993222196

來源

2017-05-31 19:22:45

謝謝。尺寸不同的唯一時間是將原始數據文件轉換爲熊貓數據框。 – John

@john你是說你可以做一個temp.groupby（['Type'，'Node']）。count（）。unique（）並且只看到一個值？ –

嗨@DavidNehme，我按照你的要求添加了.count的輸出。我想說的是每個節點（25到40）的平均值是正確的。問題是當我連接10個數據幀並計算平均值時。 – John

明白了！問題在於，我每次連接數據幀時都計算平均值，而不是連接所有數據框並計算最後的平均值。即，我將d = d.groupby(d.index).mean()移至第一個for循環，而不是在嵌套循環中。

來源

2017-06-01 14:15:55 John

所以有一個for循環？！但是，在問題中顯示的代碼中，沒有可看到的循環！如果您在原始問題中顯示了您使用的實際代碼，那麼您可能會在幾分鐘內得到答案。 –

熊貓給出了錯誤的意思

回答

相關問題