str.extract從熊貓數據幀後

我有一個數據幀與數千行的兩列像這樣開始：str.extract從熊貓數據幀後

          string  state 
0  the best new york cheesecake rochester ny   ny 
1  the best dallas bbq houston tx random str   tx 
2 la jolla fish shop of san diego san diego ca   ca 
3         nothing here   dc

對於每一個狀態，我把所有的城市名的正則表達式（小寫案例）結構像(city1|city2|city3|...)其中城市的秩序是任意的（但可以根據需要更改）。例如，紐約州的正則表達式包含'new york'和'rochester'（對於德克薩斯州同樣爲'dallas'和'houston'，對於加利福尼亞州同樣爲'san diego'和'la jolla'）。

我想找出字符串中最後出現的城市是什麼（用於觀察1，2，3，4，我會分別'rochester'，'houston'，'san diego'和NaN（或其他），希望）。

我從str.extract開始，並試圖想像顛倒絃線但陷入僵局。

非常感謝您的幫助！

來源

2017-09-04 user49007

您可以使用str.findall，但如果沒有匹配得到空list，所以需要申請。最後通過[-1]選擇字符串的最後一個項目：

cities = r"new york|dallas|rochester|houston|san diego" 

print (df['string'].str.findall(cities) 
        .apply(lambda x: x if len(x) >= 1 else ['no match val']) 
        .str[-1]) 
0  rochester 
1   houston 
2  san diego 
3 no match val 
Name: string, dtype: object

（更正> = 1到> 1）

另一種解決方案是有點劈 - 通過radd添加不匹配的字符串啓動每個字符串和添加這個字符串到城市也是：

a = 'no match val' 
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a 

print (df['string'].radd(a).str.findall(cities).str[-1]) 
0  rochester 
1   houston 
2  san diego 
3 no match val 
Name: string, dtype: object

來源

2017-09-04 06:35:42 jezrael

第一個解決方案已經足夠好了;謝謝！ – user49007

@ user49007 - 感謝您的糾正。 – jezrael

cities = r"new york|dallas|..." 

def last_match(s): 
    found = re.findall(cities, s) 
    return found[-1] if found else "" 

df['string'].apply(last_match) 
#0 rochester 
#1  houston 
#2 san diego 
#3

來源

2017-09-04 06:08:44 DyZ

str.extract從熊貓數據幀後

回答

相關問題