2017-05-25 89 views
3

我正在努力解析熊貓中的日期時間。這是我簡單的例子:嵌套熊貓數據幀中的解析日期時間

df.iloc[:10,10:] 
Out[45]: 
           response_date   revision scheduleClosedAt scheduleEventIndex scheduleId scheduleOpenedAt 
0 {u'$date': u'2012-01-10T11:00:00.000+0000'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
1 {u'$date': u'2012-01-19T13:00:00.000+0000'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
2 {u'$date': u'2011-06-15T09:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
3 {u'$date': u'2011-06-22T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
4 {u'$date': u'2011-06-30T09:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
5 {u'$date': u'2011-07-05T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
6 {u'$date': u'2011-07-14T10:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
7 {u'$date': u'2011-07-20T09:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
8 {u'$date': u'2011-07-26T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
9 {u'$date': u'2011-08-02T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 

我需要擺脫嵌套列「response_date」,並將其轉換成正常的timedate,同時保持列名「response_date」/

我想:

>> df_respons = df.response_date.apply(pd.Series) 
>> df_new_response = pd.to_datetime(df_respons) 

,但得到的錯誤:

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing 

處理嵌套達的任何簡潔的方式時間到好看的專欄?

編輯

如何忽略遺漏值?

43025 {u'$date': u'2015-11-18T10:35:00.000+0000'} 
43026 {u'$date': u'2015-11-18T14:23:00.000+0000'} 
43027 {u'$date': u'2015-11-18T14:23:00.000+0000'} 
43028 {u'$date': u'2015-11-18T15:20:00.000+0000'} 
43029 {u'$date': u'2015-11-18T15:20:00.000+0000'} 
43030           NaN 
43031           NaN 
43032 {u'$date': u'2015-11-19T08:00:00.000+0000'} 
43033 {u'$date': u'2015-11-19T08:00:00.000+0000'} 
43034 {u'$date': u'2015-11-24T08:00:00.000+0000'} 

,讓一個新的 '0' 欄:

 0     response_date 
43027 NaN 2015-11-18T14:23:00.000+0000 
43028 NaN 2015-11-18T15:20:00.000+0000 
43029 NaN 2015-11-18T15:20:00.000+0000 
43030 NaN       NaN 
43031 NaN       NaN 
43032 NaN 2015-11-19T08:00:00.000+0000 
43033 NaN 2015-11-19T08:00:00.000+0000 
43034 NaN 2015-11-24T08:00:00.000+0000 

回答

1

您可以使用combine_firstfillna用於替換NaNdict,然後可以使用DataFrame構造與values用於轉換爲numpy array然後tolist

d = {'$date':'response_date'} 
s = pd.Series([{}], index=df.index) 
df = pd.DataFrame(df['0'].combine_first(s).values.tolist()).rename(columns=d) 
#alternatively 
#df = pd.DataFrame(df['0'].fillna(s).values.tolist()).rename(columns=d) 
df['response_date'] = pd.to_datetime(df['response_date']) 
print (df) 
     response_date 
0 2015-11-18 10:35:00 
1 2015-11-18 14:23:00 
2 2015-11-18 14:23:00 
3 2015-11-18 15:20:00 
4 2015-11-18 15:20:00 
5     NaT 
6     NaT 
7 2015-11-19 08:00:00 
8 2015-11-19 08:00:00 
9 2015-11-24 08:00:00 

另外s與map

df['response_date'] = \ 
pd.to_datetime(df['response_date'].map(lambda x: x['$date'] if type(x) == dict else x)) 
print (df) 
      response_date 
43025 2015-11-18 10:35:00 
43026 2015-11-18 14:23:00 
43027 2015-11-18 14:23:00 
43028 2015-11-18 15:20:00 
43029 2015-11-18 15:20:00 
43030     NaT 
43031     NaT 
43032 2015-11-19 08:00:00 
43033 2015-11-19 08:00:00 
43034 2015-11-24 08:00:00 
1

這聽起來像你想要的東西像df.apply(lambda row: pd.to_datetime(row['response_date']['$date']), axis=1);

In [41]: df 
Out[41]: 
           response_date 
0 {'$date': '2011-06-15T09:00:00.000+0100'} 

In [42]: df['response_date'] = df.apply(lambda row: pd.to_datetime(row['response_date']['$date']), axis=1) 

In [43]: df 
Out[43]: 
     response_date 
0 2011-06-15 08:00:00 
+0

太好了,謝謝!請參閱編輯的問題。 –

+0

取決於你的意思是「忽略」;要使用NaN刪除所有行,請使用'df.dropna()';通常,http://pandas.pydata.org/pandas-docs/stable/missing_data.html包含您可以執行的各種操作的概述。或者你想做的事是'df.apply(lambda row:pd.to_datetime(row ['response_date'] ['$ date'])if not pd.isnull(row ['response_date'])else np.nan ,axis = 1)'? – fuglede

+0

謝謝。我無法從原始數據框中真正刪除缺失的值。在最壞的情況下,我可以屏蔽缺失的值,執行你的建議,然後在適當的時間插入值,同時保留原始缺失值。 –

1

試試這個:

In [70]: pd.to_datetime(
      df.response_date.map(lambda x: 
            x['$date'] if isinstance(x, dict) and '$date' in x 
              else x), 
      errors='coerce') 
Out[70]: 
0 2012-01-10 11:00:00 
1 2012-01-19 13:00:00 
2 2011-06-15 08:00:00 
3 2011-06-21 23:00:00 
4 2011-06-30 08:00:00 
5     NaT 
6     NaT 
7 2011-07-20 08:00:00 
8 2011-07-25 23:00:00 
9 2011-08-01 23:00:00 
Name: response_date, dtype: datetime64[ns]