如何只讀取文本文件中的特定行？

我試圖處理存儲在一個文本文件，它看起來像這樣test.dat數據：但是如何只讀取文本文件中的特定行？

-1411.85 2.6888 -2.09945 -0.495947 0.835799 0.215353 0.695579 
-1411.72 2.82683 -0.135555 0.928033 -0.196493 -0.183131 -0.865999 
-1412.53 0.379297 -1.00048 -0.654541 -0.0906588 0.401206 0.44239 
-1409.59 -0.0794765 -2.68794 -0.84847 0.931357 -0.31156 0.552622 
-1401.63 -0.0235102 -1.05206 0.065747 -0.106863 -0.177157 -0.549252 
.... 
....

該文件是幾個GB，我就非常喜歡讀它，行的小塊。我想使用numpy'sloadtxt函數，因爲這會將所有內容快速轉換爲numpy array。然而，我至今還沒有能夠管理的功能似乎只提供了這樣的列選擇：

data = np.loadtxt("test.dat", delimiter=' ', skiprows=1, usecols=range(1,7))

任何想法如何實現這一目標？如果loadtxt不可用Python中提供的其他選項？

來源

2015-08-15 P i

loadtxt的fname參數可以是一個生成器，所以要讀取小塊的行使用文件讀取生成器，如http://stackoverflow.com/questions/519633/lazy-method-for-reading-big中顯示的文件讀取生成器-python中的文件，但轉換爲只讀取少量的行而不是字節。 – 2015-08-15 17:10:37

另請參見：http://stackoverflow.com/a/27962976/901925 - 「用numpy的genfromtxt讀取每第n行的最快方法」 – hpaulj

hpaulj在他的評論中指出了我的正確方向。

使用下面的代碼工作完美的我：

import numpy as np 
import itertools 
with open('test.dat') as f_in: 
    x = np.genfromtxt(itertools.islice(f_in, 1, 12, None), dtype=float) 
    print x[0,:]

非常感謝！

來源

2015-08-15 18:13:36

如果你可以使用pandas，這會更容易些：如果你想讀說每個k行

In [2]: import pandas as pd 

In [3]: df = pd.read_table('test.dat', delimiter=' ', skiprows=1, usecols=range(1,7), nrows=3, header=None) 

In [4]: df.values 
Out[4]: 
array([[ 2.82683 , -0.135555 , 0.928033 , -0.196493 , -0.183131 , 
     -0.865999 ], 
     [ 0.379297 , -1.00048 , -0.654541 , -0.0906588, 0.401206 , 
     0.44239 ], 
     [-0.0794765, -2.68794 , -0.84847 , 0.931357 , -0.31156 , 
     0.552622 ]])

編輯

，您可以指定chunksize。例如，

reader = pd.read_table('test.dat', delimiter=' ', usecols=range(1,7), header=None, chunksize=2) 
for chunk in reader: 
    print(chunk.values)

日期：

[[ 2.6888 -2.09945 -0.495947 0.835799 0.215353 0.695579] 
[ 2.82683 -0.135555 0.928033 -0.196493 -0.183131 -0.865999]] 
[[ 0.379297 -1.00048 -0.654541 -0.0906588 0.401206 0.44239 ] 
[-0.0794765 -2.68794 -0.84847 0.931357 -0.31156 0.552622 ]] 
[[-0.0235102 -1.05206 0.065747 -0.106863 -0.177157 -0.549252 ]]

你必須處理如何將它們存儲在for循環，如你所願。請注意，在這種情況下，reader是TextFileReader，而不是DataFrame，因此您可以懶惰地遍歷它。

有關更多詳細信息，請參閱this。

來源

2015-08-15 17:36:18 yangjie

我不明白我會如何讀取前三個，然後是第二個三個等等。你能解釋一下嗎？感謝您的努力！ –

你的意思是把前三部分寫入ndarray，然後將下三部分寫入另一個ndarray，等等？ – yangjie

是的，那就是我需要的！ –

您可能想要使用itertools配方。

from itertools import izip_longest 
import numpy as np 


def grouper(n, iterable, fillvalue=None): 
    args = [iter(iterable)] * n 
    return izip_longest(fillvalue=fillvalue, *args) 


def lazy_reader(fp, nlines, sep, skiprows, usecols): 
    with open(fp) as inp: 
     for chunk in grouper(nlines, inp, ""): 
      yield np.loadtxt(chunk, delimiter=sep, skiprows=skiprows, usecols=usecols)

該函數返回一個數組的生成器。

lazy_data = lazy_reader(...) 
next(lazy_data) # this will give you the next chunk 
# or you can iterate 
for chunk in lazy_data: 
    ...

來源

2015-08-15 17:44:55

如何只讀取文本文件中的特定行？

回答

相關問題