如何修改熊貓數據框，插入新列

我有一些信息數據在下面提供，如何修改熊貓數據框，插入新列

df.info() is below, 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 6662 entries, 0 to 6661 
Data columns (total 2 columns): 
value  6662 non-null float64 
country 6478 non-null object 
dtypes: float64(1), object(1) 
memory usage: 156.1+ KB 
None 


list of the columns, 
[u'value' 'country'] 


the df is below, 

     value country 
0  550.00  USA 
1  118.65 CHINA 
2  120.82 CHINA 
3  86.82 CHINA 
4  112.14 CHINA 
5  113.59 CHINA 
6  114.31 CHINA 
7  111.42 CHINA 
8  117.21 CHINA 
9  111.42 CHINA 

-------------------- 
-------------------- 
6655 500.00  USA 
6656 500.00  USA 
6657 390.00  USA 
6658 450.00  USA 
6659 420.00  USA 
6660 420.00  USA 
6661 450.00  USA

我需要即添加另一列outlier，並把1 如果數據是針對各自的異常值國家，否則，我需要把0.我強調，outlier將需要計算各自的國家，而不是爲所有國家。

我找一些公式計算可能是在幫助離羣值，例如，

# keep only the ones that are within +3 to -3 standard 
def exclude_the_outliers(df): 
    df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())] 
    return df 


def exclude_the_outliers_extra(df): 

    LOWER_LIMIT = .35 
    HIGHER_LIMIT = .70 

    filt_df = df.loc[:, df.columns == 'value'] 

    # Then, computing percentiles. 
    quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT]) 

    # Next filtering values based on computed percentiles. To do that I use 
    # an apply by columns and that's it ! 
    filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) & 
             (x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0) 

    filt_df = pd.concat([df.loc[:, df.columns != 'value'], filt_df], axis=1) 
    filt_df.dropna(inplace=True) 
    return df

我無法正確地使用這些公式用於此目的，但是，作爲建議提供。最後，我需要計算數據中顯示的美國和中國的異常值的百分比。

如何實現這一目標？

注：把outlier列全零容易在 pasdas，應該是這樣的，

df['outlier'] = 0

但是，它仍然找到outlier與1用於覆蓋零問題那個國家。

來源

2017-03-08 Arefe

您可以按每個國家劃分數據框，計算切片的分位數，並將outlier的值設置爲該國家的索引。

可能有一種方法可以在不迭代的情況下做到，但它超出了我的想象。

# using True/False for the outlier, it is the same as 1/0 
df['outlier'] = False 

# set the quantile limits 
low_q = 0.35 
high_q = 0.7 

# iterate over each country 
for c in df.country.unique(): 
    # subset the dataframe where the country = c, get the quantiles 
    q = df.value[df.country==c].quantile([low_q, high_q]) 
    # at the row index where the country column equals `c` and the column is `outlier` 
    # set the value to be true or false based on if the `value` column is within 
    # the quantiles 
    df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c] 
     .apply(lambda x: x<q[low_q] or x>q[high_q]))

編輯：爲了讓每個國家離羣值的百分比，就可以GROUPBY全國柱上，用平均聚集。

gb = df[['country','outlier']].groupby('country').mean() 
for row in gb.itertuples(): 
    print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1])) 

# output: 
# Percentage of outliers for China  : 54.0% 
# Percentage of outliers for USA   : 56.0%

來源

2017-03-08 18:43:52 James

非常感謝您的回答。如何找到每個國家「異常值的百分比」？我將需要作爲控制檯打印輸出。 – Arefe

爲您的後續問題增加了一些代碼。請記住將問題標記爲已回答。 :) – James

完成並感謝所有的一切。 – Arefe

如何修改熊貓數據框，插入新列

回答

相關問題