2016-12-13 38 views
0

因此,我有關於飛機失事的數據幀。熊貓:試圖使用正則表達式的應用方法列

In []: df = pd.read_csv('Airplane_Crashes_and_Fatalities_Since_1908.csv') 
In []: df.info() 
In []: df.head() 

Out []: 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 5268 entries, 0 to 5267 
Data columns (total 13 columns): 
Date   5268 non-null object 
Time   3049 non-null object 
Location  5248 non-null object 
Operator  5250 non-null object 
Flight #  1069 non-null object 
Route   3562 non-null object 
Type   5241 non-null object 
Registration 4933 non-null object 
cn/In   4040 non-null object 
Aboard   5246 non-null float64 
Fatalities  5256 non-null float64 
Ground   5246 non-null float64 
Summary   4878 non-null object 
dtypes: float64(3), object(10) 
memory usage: 535.1+ KB 
Out []: 
     Date Time       Location \ 
0 09/17/1908 17:18     Fort Myer, Virginia 
1 07/12/1912 06:30    AtlantiCity, New Jersey 
2 08/06/1913 NaN Victoria, British Columbia, Canada 
3 09/09/1913 18:30     Over the North Sea 
4 10/17/1913 10:30   Near Johannisthal, Germany 

      Operator  Flight #   Route     Type \ 
0 Military - U.S. Army  NaN Demonstration  Wright Flyer III 
1 Military - U.S. Navy  NaN Test flight    Dirigible 
2     Private  -   NaN  Curtiss seaplane 
3 Military - German Navy  NaN   NaN Zeppelin L-1 (airship) 
4 Military - German Navy  NaN   NaN Zeppelin L-2 (airship) 

    Registration cn/In  Aboard Fatalities Ground \ 
0   NaN  1  2.0   1.0  0.0 
1   NaN NaN  5.0   5.0  0.0 
2   NaN NaN  1.0   1.0  0.0 
3   NaN NaN 20.0  14.0  0.0 
4   NaN NaN 30.0  30.0  0.0 

             Summary 
0 During a demonstration flight, a U.S. Army fly... 
1 First U.S. dirigible Akron exploded just offsh... 
2 The first fatal airplane accident in Canada oc... 
3 The airship flew into a thunderstorm and encou... 
4 Hydrogen gas which was being vented was sucked...  

所以我想分類'操作員'列並創建新的包含平面類型。 我試圖用正則表達式。適用使用():

def plane_type(plane): 
    m = re.search('\w*Military', plane) 
    p = re.search('\w*Private', plane) 
    if m: 
     return 'Military' 
    elif p: 
     return 'Private' 
    else: 
     return 'Passengers' 

df['plane_type'] = df['operator'].apply(plane_type) 

與拉姆達也試過:

df['plane_type'] = df['operator'].apply(lambda x: plane_type(x)) 

末,每次我得到類型錯誤:

TypeError: expected string or buffer 

請,有人告訴我,我錯過了什麼?

+1

嘗試:'DF [ 'plane_type'] = DF [ '操作符']。astype(STR)。適用(羊肉da x:plane_type(x))'。 – Abdou

+0

另外,還有兩件事:你的列名是'Operator',但你似乎在索引'operator',你確定那些是你想使用的'regex'模式? – Abdou

+0

@Abdou,是的,我將列名更改爲小寫,我忘了提及它。謝謝,它現在的作品:)(我只是混淆了astype(str)的順序) –

回答

0

我認爲你可以使用extract第一遺漏值(^被提取字符串的開始只值),然後fillna

df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False) 
df['plane_type'] = df['plane_type'].fillna('Passengers') 

樣品:

df = pd.DataFrame({'Operator':['Military - U.S. Navy','Private', 
           'Another Military - German', 'Other']}) 
print (df) 
        Operator 
0  Military - U.S. Navy 
1     Private 
2 Another Military - German 
3      Other 

df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False) 
df['plane_type'] = df['plane_type'].fillna('Passengers') 
print (df) 
        Operator plane_type 
0  Military - U.S. Navy Military 
1     Private  Private 
2 Another Military - German Passengers 
3      Other Passengers 

此外,如果需要提取所有由關鍵字省略值^

df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False) 
df['plane_type'] = df['plane_type'].fillna('Passengers') 
print (df) 
        Operator plane_type 
0  Military - U.S. Navy Military 
1     Private  Private 
2 Another Military - German Military 
3      Other Passengers 

時序

apply是slowier:

#400k rows 
In [80]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers') 
1 loop, best of 3: 711 ms per loop 

In [81]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type) 
1 loop, best of 3: 1.69 s per loop 

#6k rows 
In [84]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers') 
100 loops, best of 3: 10.8 ms per loop 

In [85]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type) 
10 loops, best of 3: 25.8 ms per loop 

代碼定時

df = pd.DataFrame({'Operator':['Military - U.S. Navy','Private','Another Military - German', 'Other']}) 
df = pd.concat([df]*100000).reset_index(drop=True) 
#[400000 rows x 1 columns] 
#print (df) 


df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers') 
#print (df) 

def plane_type(plane): 
    m = re.search('\w*Military', plane) 
    p = re.search('\w*Private', plane) 
    if m: 
     return 'Military' 
    elif p: 
     return 'Private' 
    else: 
     return 'Passengers' 

df['plane_type1'] = df['Operator'].astype(str).apply(plane_type) 
print (df) 
+0

**謝謝**解釋!從未使用.extract()之前 - 現在將使用它:) –