2016-04-23 133 views
2

我有以下兩個數據幀(可以發現herehere):熊貓:dataframes不會合並

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \ 
       engine='python', sep=r'\s{2,}', encoding='utf-8_sig') 

我只提供的代碼爲df閱讀,因爲它有一些獨特的格式問題。

df.dtypes 

SICcode  object 
Catcode  object 
Category object 
SICname  object 
MultSIC  object 
dtype: object 

merged.dtypes 

2012 NAICS Code  float64 
2002to2007 NAICS float64 
SICcode    object 
dtype: object 

df.columns.tolist() 
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC'] 

merged.columns.tolist() 
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode'] 

df.head(3) 

    SICcode  Catcode  Category       SICname MultSIC 
0 111   A1500 Wheat, corn, soybeans and cash grain Wheat X 
1 112   A1600 Other commodities (incl rice, peanuts) Rice X 
2 115   A1500 Wheat, corn, soybeans and cash grain Corn X 

merged.sort_values('SICcode') 

    2012 NAICS Code  2002to2007 NAICS SICcode 
89 212210      212210  1011 
93 212234      212234  1021 
92 212231      212231  1031 
90 212221      212221  1041 
91 212222      212222  1044 
96 212299      212299  1061 
94 212234      212234  1061 
119 213114      213114  1081 
1770 541360     541360  1081 
233  238910     238910  1081 
95 212291      212291  1094 
97 212299      212299  1099 
3 111140      111140  111 
6 111160      111160  112 
4 111150      111150  115 
0 111110      111110  116 

我想他們這個代碼合併到一起:merged=pd.merge(merged,df, how='right', on='SICcode')

導致此:

2012 NAICS Code  0 
2002to2007 NAICS  0 
SICcode    1007 
Catcode    991 
Category   1007 
SICname    1007 
MultSIC    906 
dtype: int64 

我懷疑問題在於的df格式,但我不知道如何描述(我聽說過white space這個詞,可能與這種情況有關)或者解決這個問題。有沒有人有這個想法?

回答

2

我相信這是你的問題的原因:

In [47]: merged[merged.SICcode == 'Aux'] 
Out[47]: 
     2012 NAICS Code 2002to2007 NAICS SICcode 
1828   551114.0   551114.0  Aux 

導致不同的數據類型:

In [61]: df.dtypes 
Out[61]: 
SICcode  int64 
Catcode  object 
Category object 
SICname  object 
MultSIC  object 
dtype: object 

In [62]: merged.dtypes 
Out[62]: 
2012 NAICS Code  float64 
2002to2007 NAICS float64 
SICcode    object 
dtype: object 

In [63]: df.SICcode.unique() 
Out[63]: array([ 111, 112, 115, ..., 9711, 9721, 9999], dtype=int64) 

In [64]: merged.SICcode.head(10).unique() 
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object) 

所以,你可以這樣來做:

url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv' 
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig') 

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge' 
merged = pd.read_csv(url, index_col=0) 

# clearing data 
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce') 

mrg = df.merge(merged, on='SICcode', how='left') 

mrg.head() 

輸出:

In [51]: mrg.head() 
Out[51]: 
    SICcode Catcode          Category \ 
0  111 A1500   Wheat, corn, soybeans and cash grain 
1  112 A1600 Other commodities (incl rice, peanuts, honey) 
2  115 A1500   Wheat, corn, soybeans and cash grain 
3  116 A1500   Wheat, corn, soybeans and cash grain 
4  119 A1500   Wheat, corn, soybeans and cash grain 

      SICname MultSIC 2012 NAICS Code 2002to2007 NAICS 
0    Wheat  X   111140.0   111140.0 
1    Rice  X   111160.0   111160.0 
2    Corn  X   111150.0   111150.0 
3   Soybeans  X   111110.0   111110.0 
4 Cash grains, NEC  X   111120.0   111120.0 
+0

謝謝MaxU! –

+1

@MichaelPeddue,總是樂於幫助:) – MaxU