這裏有一個小的函數,計算頻率分佈decriptive統計:
# from __future__ import division (for Python 2)
def descriptives_from_agg(values, freqs):
values = np.array(values)
freqs = np.array(freqs)
arg_sorted = np.argsort(values)
values = values[arg_sorted]
freqs = freqs[arg_sorted]
count = freqs.sum()
fx = values * freqs
mean = fx.sum()/count
variance = ((freqs * values**2).sum()/count) - mean**2
variance = count/(count - 1) * variance # dof correction for sample variance
std = np.sqrt(variance)
minimum = np.min(values)
maximum = np.max(values)
cumcount = np.cumsum(freqs)
Q1 = values[np.searchsorted(cumcount, 0.25*count)]
Q2 = values[np.searchsorted(cumcount, 0.50*count)]
Q3 = values[np.searchsorted(cumcount, 0.75*count)]
idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
result = pd.Series([count, mean, std, minimum, Q1, Q2, Q3, maximum], index=idx)
return result
一個演示:
np.random.seed(0)
val = np.random.normal(100, 5, 1000).astype(int)
pd.Series(val).describe()
Out:
count 1000.000000
mean 99.274000
std 4.945845
min 84.000000
25% 96.000000
50% 99.000000
75% 103.000000
max 113.000000
dtype: float64
vc = pd.value_counts(val)
descriptives_from_agg(vc.index, vc.values)
Out:
count 1000.000000
mean 99.274000
std 4.945845
min 84.000000
25% 96.000000
50% 99.000000
75% 103.000000
max 113.000000
dtype: float64
請注意,這不處理NaN和未正確測試。
的可能的複製http://stackoverflow.com/questions/17689099/using-describe-with-weighted-data –
我認爲這是* *同我聯繫的問題:你想加權描述「count」列給出的「score」列的統計信息。唉,我不認爲這個問題有一個令人滿意的答案。 –
我同意他們要求非常類似的事情,但我不知道SAS proc是如何工作的,所以我會在這裏發佈我的答案,因爲它可能不滿足這些要求。 – ayhan