2017-07-30 94 views
1

我的任務是刪除括號中的任何內容,並刪除任何數字後跟國家/地區名稱。改變一些國家的名字。pandas.replace與str.replace正則表達式衝突。代碼順序

例如 玻利維亞(多民族國)'應該'玻利維亞' 瑞士17'應該是'瑞士'。

我的原代碼順序爲:

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1) 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
energy.loc[energy['Country'] == 'United States'] 

str.replace部分工作正常。任務已完成。 當我使用最後一行來檢查我是否成功更改了國家/地區名稱。此原始代碼不起作用。但是,如果我更改代碼的順序爲:

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

然後,它成功地改變了國家名稱。 因此,我的Regex語法一定有什麼問題,如何解決這個衝突?這是爲什麼發生?

+1

似乎有沒有衝突。首先需要刪除不必要的字符串部分,然後用字典替換。首先不起作用,因爲沒有匹配的字典鍵。 – jezrael

+0

對不起,我不明白,我所做的只是改變能量['Country'] = energy ['Country']的順序。replace(dict1)Line。在絃樂部分沒有編輯任何內容。爲什麼突然變得有效? – Dylan

+0

請檢查我的答案 – jezrael

回答

3

的問題是,你需要regex=Truereplace用於替換substrings

energy = pd.DataFrame({'Country':['United States of America4', 
            'United States of America (aaa)','Slovakia']}) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

#no replace beacuse no match (numbers and()) 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
Empty DataFrame 
Columns: [Country] 
Index: [] 

energy['Country'] = energy['Country'].replace(dict1, regex=True) 
print (energy) 
       Country 
0  United States4 
1 United States (aaa) 
2    Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States 

#first data cleaning 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

#replace works nice 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States 
+0

謝謝!我認爲這些數據已經清除了名稱爲'United States of America'的數據。看完你的回答後,我用了:energy.loc [energy ['Country']。str.contains('^ United',na = False)]去檢查。我發現原始數據是'美國20',難怪它找不到匹配。 – Dylan

+0

很高興能幫到你!如果我的回答有幫助,請不要忘記[接受](http://meta.stackexchange.com/a/5235/295067) - 點擊答案旁邊的複選標記('✓')將其從灰色出來填補。謝謝。 – jezrael