如何使用pandas每次從csv文件讀取10條記錄？

我想讀取一個有1000行的csv文件，所以我決定以塊讀取這個文件。但是我在閱讀這個csv文件時遇到了問題。如何使用pandas每次從csv文件讀取10條記錄？

我想在第1次迭代時讀取前10條記錄，並在第2次迭代時將其特定列轉換爲python字典跳過前10條記錄並讀取下面的10條記錄。

Input.csv-

time,line_id,high,low,avg,total,split_counts 
1468332421098000,206,50879,50879,50879,2,"[50000,2]" 
1468332421195000,206,39556,39556,39556,2,"[30000,2]" 
1468332421383000,206,61636,61636,61636,2,"[60000,2]" 
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]" 
1468332423489000,206,38514,38445,38475,6,"[30000,6]" 
1468332421672000,206,60079,60079,60079,2,"[60000,2]" 
1468332421818000,206,44664,44664,44664,2,"[40000,2]" 
1468332422164000,206,48500,48500,48500,2,"[40000,2]" 
1468332423490000,206,39469,37894,38206,12,"[30000,12]" 
1468332422538000,206,44023,44023,44023,2,"[40000,2]" 
1468332423491000,206,38813,38813,38813,2,"[30000,2]" 
1468332423528000,206,75970,75970,75970,2,"[70000,2]" 
1468332423533000,206,42546,42470,42508,4,"[40000,4]" 
1468332423536000,206,41065,40888,40976,4,"[40000,4]" 
1468332423566000,206,66401,62453,64549,6,"[60000,6]"

程序代碼 -

if __name__ == '__main__': 
    s = 0 
    while(True): 
     n = 10 
     df = pandas.read_csv('Input.csv', skiprows=s, nrows=n) 
     d = dict(zip(df.time, df.split_counts)) 
     print d 
     s += n

我面對的反應這個問題

AttributeError: 'DataFrame' object has no attribute 'time'

我知道在第二次迭代它無法確定時間和split_counts屬性但是有什麼辦法做我想要的？

來源

2016-07-25 kit

您還可以使用read_csv的chunksize參數。這意味着這是O（n）而不是O（n^2），因爲你只能讀一次文件。 –

第一次迭代應該可以正常工作，但是任何進一步的迭代都是有問題的。

read_csv具有headers kwarg與默認值infer（這基本上0）。這意味着解析後的csv中的第一行將用作數據框中列的名稱。

read_csv也有另一個kwarg，names。

正如documentation解釋說：

header : int or list of ints, default ‘infer’ Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None

你應該通過headers=None和names=['time', 'line_id', 'high', 'low', 'avg', 'total', 'split_counts']到read_csv。

來源

2016-07-25 06:44:58 DeepSpace

是的，我嘗試了這個解決方案，但是'while（True）'有問題，我在上次迭代時出錯：'EmptyDataError：沒有從文件解析的列' – jezrael

@ jezrael-是的，我們需要照顧它將返回空數據框。 – kit

@ DeepSpace-謝謝。 – kit

您可以使用，而在read_csvchunksize：

import pandas as pd 
import io 

temp=u'''time,line_id,high,low,avg,total,split_counts 
1468332421098000,206,50879,50879,50879,2,"[50000,2]" 
1468332421195000,206,39556,39556,39556,2,"[30000,2]" 
1468332421383000,206,61636,61636,61636,2,"[60000,2]" 
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]" 
1468332423489000,206,38514,38445,38475,6,"[30000,6]" 
1468332421672000,206,60079,60079,60079,2,"[60000,2]" 
1468332421818000,206,44664,44664,44664,2,"[40000,2]" 
1468332422164000,206,48500,48500,48500,2,"[40000,2]" 
1468332423490000,206,39469,37894,38206,12,"[30000,12]" 
1468332422538000,206,44023,44023,44023,2,"[40000,2]" 
1468332423491000,206,38813,38813,38813,2,"[30000,2]" 
1468332423528000,206,75970,75970,75970,2,"[70000,2]" 
1468332423533000,206,42546,42470,42508,4,"[40000,4]" 
1468332423536000,206,41065,40888,40976,4,"[40000,4]" 
1468332423566000,206,66401,62453,64549,6,"[60000,6]"''' 
#after testing replace io.StringIO(temp) to filename 

#for testing 2 
reader = pd.read_csv(io.StringIO(temp), chunksize=2) 
print (reader) 
<pandas.io.parsers.TextFileReader object at 0x000000000AD1CD68>

for df in reader: 
    print(dict(zip(df.time, df.split_counts))) 

{1468332421098000: '[50000,2]', 1468332421195000: '[30000,2]'} 
{1468332421383000: '[60000,2]', 1468332423568000: '[30000,2][40000,2]'} 
{1468332423489000: '[30000,6]', 1468332421672000: '[60000,2]'} 
{1468332421818000: '[40000,2]', 1468332422164000: '[40000,2]'} 
{1468332423490000: '[30000,12]', 1468332422538000: '[40000,2]'} 
{1468332423491000: '[30000,2]', 1468332423528000: '[70000,2]'} 
{1468332423533000: '[40000,4]', 1468332423536000: '[40000,4]'} 
{1468332423566000: '[60000,6]'}

見pandas documentation。

來源

2016-07-25 06:57:07 jezrael

@ jezrael-謝謝你的回覆。這也是做我想做的事的好方法。 – kit

如何使用pandas每次從csv文件讀取10條記錄？

回答

相關問題