2016-07-25 123 views
0

我想讀取一個有1000行的csv文件,所以我決定以塊讀取這個文件。但是我在閱讀這個csv文件時遇到了問題。如何使用pandas每次從csv文件讀取10條記錄?

我想在第1次迭代時讀取前10條記錄,並在第2次迭代時將其特定列轉換爲python字典跳過前10條記錄並讀取下面的10條記錄。

Input.csv-

time,line_id,high,low,avg,total,split_counts 
1468332421098000,206,50879,50879,50879,2,"[50000,2]" 
1468332421195000,206,39556,39556,39556,2,"[30000,2]" 
1468332421383000,206,61636,61636,61636,2,"[60000,2]" 
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]" 
1468332423489000,206,38514,38445,38475,6,"[30000,6]" 
1468332421672000,206,60079,60079,60079,2,"[60000,2]" 
1468332421818000,206,44664,44664,44664,2,"[40000,2]" 
1468332422164000,206,48500,48500,48500,2,"[40000,2]" 
1468332423490000,206,39469,37894,38206,12,"[30000,12]" 
1468332422538000,206,44023,44023,44023,2,"[40000,2]" 
1468332423491000,206,38813,38813,38813,2,"[30000,2]" 
1468332423528000,206,75970,75970,75970,2,"[70000,2]" 
1468332423533000,206,42546,42470,42508,4,"[40000,4]" 
1468332423536000,206,41065,40888,40976,4,"[40000,4]" 
1468332423566000,206,66401,62453,64549,6,"[60000,6]" 

程序代碼 -

if __name__ == '__main__': 
    s = 0 
    while(True): 
     n = 10 
     df = pandas.read_csv('Input.csv', skiprows=s, nrows=n) 
     d = dict(zip(df.time, df.split_counts)) 
     print d 
     s += n 

我面對的反應這個問題

AttributeError: 'DataFrame' object has no attribute 'time' 

我知道在第二次迭代它無法確定時間和split_counts屬性但是有什麼辦法做我想要的?

+0

您還可以使用read_csv的chunksize參數。這意味着這是O(n)而不是O(n^2),因爲你只能讀一次文件。 –

回答

1

第一次迭代應該可以正常工作,但是任何進一步的迭代都是有問題的。

read_csv具有headers kwarg與默認值infer(這基本上0)。這意味着解析後的csv中的第一行將用作數據框中列的名稱。

read_csv也有另一個kwarg,names

正如documentation解釋說:

header : int or list of ints, default ‘infer’ Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None

你應該通過headers=Nonenames=['time', 'line_id', 'high', 'low', 'avg', 'total', 'split_counts']read_csv

+0

是的,我嘗試了這個解決方案,但是'while(True)'有問題,我在上次迭代時出錯:'EmptyDataError:沒有從文件解析的列' – jezrael

+0

@ jezrael-是的,我們需要照顧它將返回空數據框。 – kit

+0

@ DeepSpace-謝謝。 – kit

1

您可以使用,而在read_csvchunksize

import pandas as pd 
import io 

temp=u'''time,line_id,high,low,avg,total,split_counts 
1468332421098000,206,50879,50879,50879,2,"[50000,2]" 
1468332421195000,206,39556,39556,39556,2,"[30000,2]" 
1468332421383000,206,61636,61636,61636,2,"[60000,2]" 
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]" 
1468332423489000,206,38514,38445,38475,6,"[30000,6]" 
1468332421672000,206,60079,60079,60079,2,"[60000,2]" 
1468332421818000,206,44664,44664,44664,2,"[40000,2]" 
1468332422164000,206,48500,48500,48500,2,"[40000,2]" 
1468332423490000,206,39469,37894,38206,12,"[30000,12]" 
1468332422538000,206,44023,44023,44023,2,"[40000,2]" 
1468332423491000,206,38813,38813,38813,2,"[30000,2]" 
1468332423528000,206,75970,75970,75970,2,"[70000,2]" 
1468332423533000,206,42546,42470,42508,4,"[40000,4]" 
1468332423536000,206,41065,40888,40976,4,"[40000,4]" 
1468332423566000,206,66401,62453,64549,6,"[60000,6]"''' 
#after testing replace io.StringIO(temp) to filename 

#for testing 2 
reader = pd.read_csv(io.StringIO(temp), chunksize=2) 
print (reader) 
<pandas.io.parsers.TextFileReader object at 0x000000000AD1CD68> 
for df in reader: 
    print(dict(zip(df.time, df.split_counts))) 

{1468332421098000: '[50000,2]', 1468332421195000: '[30000,2]'} 
{1468332421383000: '[60000,2]', 1468332423568000: '[30000,2][40000,2]'} 
{1468332423489000: '[30000,6]', 1468332421672000: '[60000,2]'} 
{1468332421818000: '[40000,2]', 1468332422164000: '[40000,2]'} 
{1468332423490000: '[30000,12]', 1468332422538000: '[40000,2]'} 
{1468332423491000: '[30000,2]', 1468332423528000: '[70000,2]'} 
{1468332423533000: '[40000,4]', 1468332423536000: '[40000,4]'} 
{1468332423566000: '[60000,6]'} 

pandas documentation

+0

@ jezrael-謝謝你的回覆。這也是做我想做的事的好方法。 – kit