在打開的文件上使用熊貓read_csv（）兩次

正如我在嘗試使用熊貓時，我注意到了pandas.read_csv的一些奇怪行爲，並想知道是否有更多經驗的人可以解釋可能導致它的原因。在打開的文件上使用熊貓read_csv（）兩次

要啓動，這是我從.csv文件創建一個新的pandas.dataframe基本的類定義：

import pandas as pd 

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath # File path to the target .csv file. 
     self.csvfile = open(filepath) # Open file. 
     self.csvdataframe = pd.read_csv(self.csvfile)

現在，這個工作得很好，並調用類的我__主要__.py成功地創建了一個數據幀大熊貓：

From dataMatrix.py import dataMatrix 

testObject = dataMatrix('/path/to/csv/file')

但我注意到，這個過程是自動設置的.csv作爲pandas.dataframe.columns指數的第一行。相反，我決定編號列。由於我不想假設我已經知道列的數量，所以我採取了打開文件，將其加載到數據框，計算列數，然後使用範圍重新加載數據框的方法（）。

import pandas as pd 

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 
     self.csvfile = open(filepath) 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(self.csvfile) 
     # Count the columns. 
     self.numcolumns = len(self.csvdataframe.columns) 
     # Re-load the .csv file, manually setting the column names to their 
     # number. 
     self.csvdataframe = pd.read_csv(self.csvfile, 
             names=range(self.numcolumns))

保持我的處理__主要__.py一樣的，我回來用適當的名稱（0 ... 499）的正確的列數（500在這種情況下）一個數據幀，但它是否則爲空（無行數據）。

抓我的頭，我決定關閉self.csvfile並重新加載它，像這樣：

import pandas as pd 

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 
     self.csvfile = open(filepath) 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(self.csvfile) 
     # Count the columns. 
     self.numcolumns = len(self.csvdataframe.columns) 

     # Close the .csv file.   #<---- +++++++ 
     self.csvfile.close()   #<---- Added 
     # Re-open file.    #<---- Block 
     self.csvfile = open(filepath) #<---- +++++++ 

     # Re-load the .csv file, manually setting the column names to their 
     # number. 
     self.csvdataframe = pd.read_csv(self.csvfile, 
             names=range(self.numcolumns))

關閉文件並重新打開它用pandas.dataframe返回正確的列編號爲0 ... 499和隨後的所有255行數據。

我的問題是爲什麼關閉文件並重新打開它有所作爲？

來源

2014-09-19 Grant Hulegaard

當您打開與

open(filepath)

文件句柄迭代文件返回。一個迭代器適用於一次遍歷其內容。所以

self.csvdataframe = pd.read_csv(self.csvfile)

讀取內容並用盡迭代器。後續調用pd.read_csv認爲迭代器爲空。

請注意，您可以通過剛好路過的文件路徑pd.read_csv避免這個問題：

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(filepath) 
     # Count the columns. 
     self.numcolumns = len(self.csvdataframe.columns) 


     # Re-load the .csv file, manually setting the column names to their 
     # number. 
     self.csvdataframe = pd.read_csv(filepath, 
             names=range(self.numcolumns))

pd.read_csv會再開（閉）爲您的文件。

PS。另一個選項是通過調用self.csvfile.seek(0)將文件句柄重置爲文件的開頭，但使用pd.read_csv(filepath, ...)仍然更容易。有關文件迭代器信息

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(filepath) 
     self.numcolumns = len(self.csvdataframe.columns) 
     self.csvdataframe.columns = range(self.numcolumns)

來源

2014-09-19 22:27:43 unutbu

感謝：

更妙的是，不是調用pd.read_csv兩倍（這是低效的），你可以重命名列這樣的。這就說得通了。我將進行更改以傳遞「文件路徑」而不是打開的文件。但是，按照您在最後建議的方式重命名列將替換列名稱，這意味着我丟失了第一行數據。 – 2014-09-21 22:25:25

然後添加'header = None'，這樣第一行數據將成爲數據的一部分，而不是解釋爲列名。 – unutbu 2014-09-21 23:08:08

啊是的，我忘了標題=無...我有問題得到這個工作，但這是一個單獨的問題。感謝您回答我原來的問題！我只是對導致「開放文件」行爲的較低級別的「幕後」交互感到好奇。謝謝！ – 2014-09-22 03:31:56

在打開的文件上使用熊貓read_csv（）兩次

回答

相關問題