pandas.replace與str.replace正則表達式衝突。代碼順序

我的任務是刪除括號中的任何內容，並刪除任何數字後跟國家/地區名稱。改變一些國家的名字。pandas.replace與str.replace正則表達式衝突。代碼順序

例如玻利維亞（多民族國）'應該'玻利維亞' 瑞士17'應該是'瑞士'。

我的原代碼順序爲：

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1) 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
energy.loc[energy['Country'] == 'United States']

的str.replace部分工作正常。任務已完成。當我使用最後一行來檢查我是否成功更改了國家/地區名稱。此原始代碼不起作用。但是，如果我更改代碼的順序爲：

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

然後，它成功地改變了國家名稱。因此，我的Regex語法一定有什麼問題，如何解決這個衝突？這是爲什麼發生？

來源

2017-07-30 Dylan

似乎有沒有衝突。首先需要刪除不必要的字符串部分，然後用字典替換。首先不起作用，因爲沒有匹配的字典鍵。 – jezrael

對不起，我不明白，我所做的只是改變能量['Country'] = energy ['Country']的順序。replace（dict1）Line。在絃樂部分沒有編輯任何內容。爲什麼突然變得有效？ – Dylan

請檢查我的答案 – jezrael

的問題是，你需要regex=Truereplace用於替換substrings：

energy = pd.DataFrame({'Country':['United States of America4', 
            'United States of America (aaa)','Slovakia']}) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"}

#no replace beacuse no match (numbers and()) 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
Empty DataFrame 
Columns: [Country] 
Index: []

energy['Country'] = energy['Country'].replace(dict1, regex=True) 
print (energy) 
       Country 
0  United States4 
1 United States (aaa) 
2    Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States

#first data cleaning 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

#replace works nice 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States

來源

2017-07-30 08:26:39 jezrael

謝謝！我認爲這些數據已經清除了名稱爲'United States of America'的數據。看完你的回答後，我用了：energy.loc [energy ['Country']。str.contains（'^ United'，na = False）]去檢查。我發現原始數據是'美國20'，難怪它找不到匹配。 – Dylan

很高興能幫到你！如果我的回答有幫助，請不要忘記[接受]（http://meta.stackexchange.com/a/5235/295067） - 點擊答案旁邊的複選標記（'✓'）將其從灰色出來填補。謝謝。 – jezrael

pandas.replace與str.replace正則表達式衝突。代碼順序

回答

相關問題