2017-10-06 48 views
0

我有關鍵字如何做好大熊貓關鍵詞映射

India 
Japan 
United States 
Germany 
China 

這裏的樣本數據幀

id Address 
1  Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 
2  Arcisstraße 21, 80333 München, Germany 
3  Liberty Street, Manhattan, New York, United States 
4  30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 
5  Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India 

我的目標是讓

id Address               India Japan United States Germany China  
1  Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan    0  1  0    0  0     
2  Arcisstraße 21, 80333 München, Germany       0  0  0    1  0 
3  Liberty Street, Manhattan, New York, USA      0  0  1    0  0 
4  30 Shuangqing Rd, Haidian Qu, Beijing Shi, China    0  0  0    0  1 
5  Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India  1  0  0    0  0 

的基本想法是建立關鍵字檢測,我我正在考慮使用str.containword2vec,但我無法獲得邏輯

回答

3

利用pd.get_dummies()

countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False) 
dummies = pd.get_dummies(countries) 
pd.concat([df,dummies],axis = 1) 

而且,最直接的方法是讓列表中的國家和使用for循環,說

countries = ['India','Japan','United States','Germany','China'] 
for c in countries: 
    df[c] = df.Address.str.contains(c) * 1 

,但是如果你有很多數據和國家,它可能會很慢。

+0

這有一個錯字,並沒有運行。 – piRSquared

3
In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies()) 

In [59]: df 
Out[59]: 
    id           Address China Germany India Japan United States 
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J...  0   0  0  1    0 
1 2    Arcisstra?e 21, 80333 Munchen, Germany  0   1  0  0    0 
2 3 Liberty Street, Manhattan, New York, United St...  0   0  0  0    1 
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China  1   0  0  0    0 
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo...  0   0  1  0    0 

注:如果國家無法在最後位置Address列或這種方法是行不通的,如果國名包含,

+1

我在打電話。我從頭頂回答。你能否確認我的答案有效? – piRSquared

+0

這是一個str.find的ufunc。我可以使用地址和關鍵字進行廣播。如果找到關鍵字,則返回該位置。否則返回-1。因此> = 0 – piRSquared

+0

謝謝。當我回到電腦時,我會在幾個小時內修復它。 – piRSquared

2
from numpy.core.defchararray import find 

kw = 'India|Japan|United States|Germany|China'.split('|') 
a = df.Address.values.astype(str)[:, None] 

df.join(
    pd.DataFrame(
     find(a, kw) >= 0, 
     df.index, kw, 
     dtype=int 
    ) 
) 

    id      Address India Japan United States Germany China 
0 1 Chome-2-8 Shibakoen, Minat...  0  1    0  0  0 
1 2 Arcisstraße 21, 80333 Münc...  0  0    0  1  0 
2 3 Liberty Street, Manhattan,...  0  0    1  0  0 
3 4 30 Shuangqing Rd, Haidian ...  0  0    0  0  1 
4 5 Vaishnavi Summit,80feet Ro...  1  0    0  0  0