2017-10-09 109 views
1

我有一個csv太大,無法一次讀入內存,所以我想把它塊出來,並一塊一塊地裝入keras模型。儘管chunksize & steps_per_epoch正確地說明了我的csv中有多少行,但我仍然對StopIteration錯誤產生誤解,因此我誤解了fit_generator函數的工作原理。Keras fit_generator與熊貓迭代器對象

代碼:

import pandas as pd 
import numpy as np 
from keras.models import Sequential 
from keras.layers import Dense, Dropout 

np.random.seed(26) 
x_train_generator = pd.read_csv('X_train.csv', header=None, chunksize=150000) 
y_train_generator = pd.read_csv('Y_train.csv', header=None, chunksize=150000) 
x_test_generator = pd.read_csv('X_test.csv', header=None, chunksize=50000) 
y_test_generator = pd.read_csv('Y_test.csv', header=None, chunksize=50000) 

model = Sequential() 
model.add(Dense(500, input_dim=1132, activation='tanh')) 
model.add(Dense(1, activation='sigmoid')) 

model.compile(loss='binary_crossentropy', metrics=['accuracy'], 
       optimizer='adam') 

model.fit_generator((x_train_generator.get_chunk().as_matrix(), 
        y_train_generator.get_chunk().as_matrix()), 
      steps_per_epoch=37, 
      epochs=1, 
      verbose=2, 
      validation_data=(x_test_generator.get_chunk().as_matrix(), 
          y_test_generator.get_chunk().as_matrix()), 
      validation_steps=37 
      ) 

錯誤輸出:

Exception in thread Thread-107:                                            
Traceback (most recent call last):                                           
    File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner                                  
    self.run()                                                
    File "/usr/lib/python2.7/threading.py", line 754, in run                                     
    self.__target(*self.__args, **self.__kwargs) 
    File "/home/user/myenv/local/lib/python2.7/site-packages/keras/utils/data_utils.py", line 568, in data_generator_task 
    generator_output = next(self._generator) 
TypeError: tuple object is not an iterator 

--------------------------------------------------------------------------- 
StopIteration        Traceback (most recent call last) 
/home/user/tmp_keras.py in <module>() 
    22   verbose=2, 
    23   validation_data=(x_test_generator.get_chunk().as_matrix(), y_test_generator.get_chunk().as_matrix()), 
---> 24   validation_steps=37 
    25    ) 
    26 

/home/user/myenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs) 
    85     warnings.warn('Update your `' + object_name + 
    86        '` call to the Keras 2 API: ' + signature, stacklevel=2) 
---> 87    return func(*args, **kwargs) 
    88   wrapper._original_function = func 
    89   return wrapper 

/home/user/myenv/local/lib/python2.7/site-packages/keras/models.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_$ueue_size, workers, use_multiprocessing, initial_epoch) 
    1119           workers=workers, 
    1120           use_multiprocessing=use_multiprocessing, 
-> 1121           initial_epoch=initial_epoch) 
    1122 
    1123  @interfaces.legacy_generator_methods_support 

/home/user/myenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs) 
    85     warnings.warn('Update your `' + object_name + 
    86        '` call to the Keras 2 API: ' + signature, stacklevel=2) 
---> 87    return func(*args, **kwargs) 
    88   wrapper._original_function = func 
    89   return wrapper 

/home/user/myenv/local/lib/python2.7/site-packages/keras/engine/training.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weig 
ht, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 
    2009     batch_index = 0 
    2010     while steps_done < steps_per_epoch: 
-> 2011      generator_output = next(output_generator) 
    2012 
    2013      if not hasattr(generator_output, '__len__'): 

StopIteration: 

古怪,如果我在一個while 1: try: ... except StopIteration:包住fit_generator()它管理運行。

我試過在fit_generator參數中使用x/y_train_generator而沒有get_chunk().as_matrix()函數,但它失敗了,因爲我沒有傳遞keras一個numpy數組。

+0

你知道'chunksize = 150000'嗎?另外,你知道你是否需要它嗎?如果你不知道你是否需要它,你可能不需要。 –

+0

它獲取數據幀的下一個150000行,對嗎? csv超過500萬行和> 20 GB,所以我知道讀取它的唯一方法是chunksize或指定'iterator = True'。 – user3555455

+0

它返回一個迭代器對象,你仍然需要遍歷它。 –

回答

1

正如在評論中提到,你的問題是,熊貓.get_chunk()返回一個迭代器,這是什麼.as_matrix()方法被調用(而不是你希望發生的事情 - 要通過.get_chunk()返回的迭代器轉變成DataFrame 第一個,然後.as_matrix()被調用)。

要重構您的代碼,您需要一個循環,並且您需要在循環內更新模型。我對你有兩個建議:

  1. 最簡單)重新結構上面的程序:有超過來自熊貓每個塊的數據幀的循環,你就可以打電話.as_matrix()之前。這樣,你實際上得到了一個具體的數據框,用於你的X_trainy_trainX_test,y_test數據,而不是IO迭代器。然後,您可以使用新的數據塊更新您的訓練模型。 (如果你已經有一個訓練有素的模型,你再打電話.fit(),將更新現有的模型。)

  2. 使用Keras的功能,而不是熊貓功能)利用內置Keras實用程序讀取大型數據集 - 具體地說,Keras實用程序HDF5Matrix (link to Keras documentation)以塊形式讀取HDF5文件中的數據,並將該數據透明地視爲Numpy數組。事情是這樣的:

    def load_data(path_todata, start_ix, n_samples): 
        """ 
        This works for loading testing or training data. 
        This assumes input data have been named "inputs", 
        output data have been named "outputs" in HDF5 file, 
        and that you are grabbing n_samples from the file. 
        """ 
        X = HDF5Matrix(path_to_training_data, 'inputs', start_ix, start_ix + n_samples) 
        y = HDF5Matrix(path_to_training_data, 'outputs', start_ix, start_ix + n_samples) 
        return (X,y) 
    
    X_train, y_train = load_data(path_to_training_h5, train_start_ix, n_training_samples) 
    X_test, y_test = load_data(path_to_testing_h5, testing_start_ix, n_testing_samples) 
    

樣溶液#1,這將在總體中被構造用於循環,更新start_ixn_samples每次迭代中,除了更新(重新裝配)中的每個模型迭代。有關如何使用另一個例子HDF5Matrix看到從GitHub用戶@jfsantos this example

+0

謝謝!你的兩個建議都給了我很多幫助。 – user3555455