2016-07-28 84 views
3

我有一個多重dask核心系列,我想合併成一個數據框,進一步寫入一個CSV文件,我該怎麼做。我正試圖執行相同的,請諮詢以下錯誤......錯誤,同時將dask序列連接成數據幀

數據

1,2014-04-07T10:51:09.277Z,214536502,0 
1,2014-04-07T10:54:09.868Z,214536500,0 
1,2014-04-07T10:54:46.998Z,214536506,0 
1,2014-04-07T10:57:00.306Z,214577561,0 
2,2014-04-07T13:56:37.614Z,214662742,0 
2,2014-04-07T13:57:19.373Z,214662742,0 
2,2014-04-07T13:58:37.446Z,214825110,0 
2,2014-04-07T13:59:50.710Z,214757390,0 
2,2014-04-07T14:00:38.247Z,214757407,0 
2,2014-04-07T14:02:36.889Z,214551617,0 

代碼

import dask 
import datetime as dt 
clicksdat = dd.read_csv('C:\Users\TG\Downloads\yoochoose-dataFull\yoochoose-clicks100.dat', names=['Sid','Timestamp','itemid','itemcategory'], dtype={'sid':np.int64,'timestamp':np.object,'itemid':np.object,'itemcategory':np.object}) 
clicksdat['Timestamp']=clicksdat.Timestamp.apply(pd.to_datetime) 
segment = ['EM']*24 
segment[7:10] = ['M']*3 
segment[10:13] = ['A']*3 
segment[13:18] = ['E']*5 
segment[18:23] = ['N']*5 
segment[23] = 'MN' 

maxtemp=clicksdat.groupby('Sid')['Timestamp'].max() 
mintemp=clicksdat.groupby('Sid')['Timestamp'].min() 
duration=(maxtemp.sub(mintemp).apply(lambda x: x.total_seconds())) 
day=maxtemp.apply(lambda x: x.day) 
month=maxtemp.apply(lambda x: x.month) 
noofnavigations=[clicksdat.groupby('Sid').count().Timestamp][0] 
totalitems=clicksdat.groupby('Sid')['itemid'].nunique() 
totalcats=clicksdat.groupby('Sid')['itemcategory'].nunique() 
timesegment= maxtemp.apply(lambda x: segment[x.hour]) 
segmentchange=((maxtemp.apply(lambda x: segment[x.hour])!=mintemp.apply(lambda x: segment[x.hour]))) 
purchased=(clicksdat['Sid'].unique()).apply(lambda x: x in buyersession) 

print(type(maxtemp),type(mintemp),type(duration),type(day),type(month),type(noofnavigations),type(totalitems),type(totalcats),type(timesegment),type(segmentchange),type(purchased)) 
#percentile_list = pd.DataFrame({'purchased' : purchased,'duration':duration,'day':day,'month':month,'noofnavigations':noofnavigations,'totalitems':totalitems,'totalcats':totalcats,'timesegment':timesegment,'segmentchange':segmentchange },index=noofnavigations.index) 
percentile_list = dd.concat([purchased,duration,day,month,noofnavigations,totalitems,totalcats,timesegment,segmentchange],axis=1)       
percentile_list.to_csv('C:\Users\TG\Downloads\yoochoose-dataFull\yoochoose-clicks1001-727.csv') 

錯誤

(<class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>) 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-121-ad7fc3cf8839> in <module>() 
    25 print(type(maxtemp),type(mintemp),type(duration),type(day),type(month),type(noofnavigations),type(totalitems),type(totalcats),type(timesegment),type(segmentchange),type(purchased)) 
    26 #percentile_list = pd.DataFrame({'purchased' : purchased,'duration':duration,'day':day,'month':month,'noofnavigations':noofnavigations,'totalitems':totalitems,'totalcats':totalcats,'timesegment':timesegment,'segmentchange':segmentchange },index=noofnavigations.index) 
---> 27 percentile_list = dd.concat([purchased,duration,day,month,noofnavigations,totalitems,totalcats,timesegment,segmentchange],axis=1) 
    28 
    29 percentile_list.to_csv('C:\Users\TG\Downloads\yoochoose-dataFull\yoochoose-clicks1001-727.csv') 

C:\Users\TG\Anaconda3\envs\dato-env\lib\site-packages\dask\dataframe\multi.pyc in concat(dfs, axis, join, interleave_partitions) 
    576  else: 
    577   if axis == 1: 
--> 578    raise ValueError('Unable to concatenate DataFrame with unknown ' 
    579        'division specifying axis=1') 
    580   else: 

ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1 
+0

您能否包含一個對其他人們去跑步? http://stackoverflow.com/help/mcve – MRocklin

+0

@MRocklin通過例子希望你的意思是我現在更新的數據。 –

回答

0

第一 - 你的代碼沒有運行 - 因爲有一些未定義的引用(dd,np)。因此,如果不投入不必要的時間,我無法重現您的問題。
但是,因爲我有類似的問題,我有一個想法:嘗試爲您的數據框設置索引。 (在我的情況下,只要有一個有效的索引,所有工作都可以正常工作,但使用.drop_duplicates()會以某種方式破壞索引或分區,並且我會遇到同樣的錯誤)