熊貓：試圖使用正則表達式的應用方法列

因此，我有關於飛機失事的數據幀。熊貓：試圖使用正則表達式的應用方法列

In []: df = pd.read_csv('Airplane_Crashes_and_Fatalities_Since_1908.csv') 
In []: df.info() 
In []: df.head() 

Out []: 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 5268 entries, 0 to 5267 
Data columns (total 13 columns): 
Date   5268 non-null object 
Time   3049 non-null object 
Location  5248 non-null object 
Operator  5250 non-null object 
Flight #  1069 non-null object 
Route   3562 non-null object 
Type   5241 non-null object 
Registration 4933 non-null object 
cn/In   4040 non-null object 
Aboard   5246 non-null float64 
Fatalities  5256 non-null float64 
Ground   5246 non-null float64 
Summary   4878 non-null object 
dtypes: float64(3), object(10) 
memory usage: 535.1+ KB 
Out []: 
     Date Time       Location \ 
0 09/17/1908 17:18     Fort Myer, Virginia 
1 07/12/1912 06:30    AtlantiCity, New Jersey 
2 08/06/1913 NaN Victoria, British Columbia, Canada 
3 09/09/1913 18:30     Over the North Sea 
4 10/17/1913 10:30   Near Johannisthal, Germany 

      Operator  Flight #   Route     Type \ 
0 Military - U.S. Army  NaN Demonstration  Wright Flyer III 
1 Military - U.S. Navy  NaN Test flight    Dirigible 
2     Private  -   NaN  Curtiss seaplane 
3 Military - German Navy  NaN   NaN Zeppelin L-1 (airship) 
4 Military - German Navy  NaN   NaN Zeppelin L-2 (airship) 

    Registration cn/In  Aboard Fatalities Ground \ 
0   NaN  1  2.0   1.0  0.0 
1   NaN NaN  5.0   5.0  0.0 
2   NaN NaN  1.0   1.0  0.0 
3   NaN NaN 20.0  14.0  0.0 
4   NaN NaN 30.0  30.0  0.0 

             Summary 
0 During a demonstration flight, a U.S. Army fly... 
1 First U.S. dirigible Akron exploded just offsh... 
2 The first fatal airplane accident in Canada oc... 
3 The airship flew into a thunderstorm and encou... 
4 Hydrogen gas which was being vented was sucked...

所以我想分類'操作員'列並創建新的包含平面類型。我試圖用正則表達式。適用使用（）：

def plane_type(plane): 
    m = re.search('\w*Military', plane) 
    p = re.search('\w*Private', plane) 
    if m: 
     return 'Military' 
    elif p: 
     return 'Private' 
    else: 
     return 'Passengers' 

df['plane_type'] = df['operator'].apply(plane_type)

與拉姆達也試過：

df['plane_type'] = df['operator'].apply(lambda x: plane_type(x))

末，每次我得到類型錯誤：

TypeError: expected string or buffer

請，有人告訴我，我錯過了什麼？

來源

2016-12-13 Konstantin Kim

嘗試：'DF [ 'plane_type'] = DF [ '操作符']。astype（STR）。適用（羊肉da x：plane_type（x））'。 – Abdou

另外，還有兩件事：你的列名是'Operator'，但你似乎在索引'operator'，你確定那些是你想使用的'regex'模式？ – Abdou

@Abdou，是的，我將列名更改爲小寫，我忘了提及它。謝謝，它現在的作品:)（我只是混淆了astype（str）的順序） –

我認爲你可以使用extract第一遺漏值（^被提取字符串的開始只值），然後fillna：

df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False) 
df['plane_type'] = df['plane_type'].fillna('Passengers')

樣品：

df = pd.DataFrame({'Operator':['Military - U.S. Navy','Private', 
           'Another Military - German', 'Other']}) 
print (df) 
        Operator 
0  Military - U.S. Navy 
1     Private 
2 Another Military - German 
3      Other 

df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False) 
df['plane_type'] = df['plane_type'].fillna('Passengers') 
print (df) 
        Operator plane_type 
0  Military - U.S. Navy Military 
1     Private  Private 
2 Another Military - German Passengers 
3      Other Passengers

此外，如果需要提取所有由關鍵字省略值^：

df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False) 
df['plane_type'] = df['plane_type'].fillna('Passengers') 
print (df) 
        Operator plane_type 
0  Military - U.S. Navy Military 
1     Private  Private 
2 Another Military - German Military 
3      Other Passengers

個

時序：

apply是slowier：

#400k rows 
In [80]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers') 
1 loop, best of 3: 711 ms per loop 

In [81]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type) 
1 loop, best of 3: 1.69 s per loop

#6k rows 
In [84]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers') 
100 loops, best of 3: 10.8 ms per loop 

In [85]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type) 
10 loops, best of 3: 25.8 ms per loop

代碼定時：

df = pd.DataFrame({'Operator':['Military - U.S. Navy','Private','Another Military - German', 'Other']}) 
df = pd.concat([df]*100000).reset_index(drop=True) 
#[400000 rows x 1 columns] 
#print (df) 


df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers') 
#print (df) 

def plane_type(plane): 
    m = re.search('\w*Military', plane) 
    p = re.search('\w*Private', plane) 
    if m: 
     return 'Military' 
    elif p: 
     return 'Private' 
    else: 
     return 'Passengers' 

df['plane_type1'] = df['Operator'].astype(str).apply(plane_type) 
print (df)

來源

2016-12-13 19:48:05 jezrael

**謝謝**解釋！從未使用.extract（）之前 - 現在將使用它:) –

熊貓：試圖使用正則表達式的應用方法列

回答

相關問題