分配大熊貓據幀列dtypes

我想設置dtype的多重列在pd.Dataframe（我有我不得不手動解析成列表的列表中的文件，因爲文件是不適合用於pd.read_csv）分配大熊貓據幀列dtypes

import pandas as pd 
print pd.DataFrame([['a','1'],['b','2']], 
        dtype={'x':'object','y':'int'}, 
        columns=['x','y'])

我得到

ValueError: entry not a 2- or 3- tuple

我可以將它們的唯一方法是通過每列變量循環和與astype重鑄。

dtypes = {'x':'object','y':'int'} 
mydata = pd.DataFrame([['a','1'],['b','2']], 
         columns=['x','y']) 
for c in mydata.columns: 
    mydata[c] = mydata[c].astype(dtypes[c]) 
print mydata['y'].dtype #=> int64

有沒有更好的方法？

來源

2014-01-17 hatmatrix

這可能是一個很好的[bug /功能請求]（https://github.com/pydata/pandas/issues/new），目前我不確定dtype arg在做什麼（你可以 –

FYI：'df = pd.DataFrame（[['a'，'1']，['b'，'2']]，dtype ='int' ，列= ['x'，'y']）'「起作用」......但：s –

是的，「起作用」的確如此;不可預知的... – hatmatrix

您可以使用convert_objects推斷更好dtypes：

In [11]: df 
Out[11]: 
    x y 
0 a 1 
1 b 2 

In [12]: df.dtypes 
Out[12]: 
x object 
y object 
dtype: object 

In [13]: df.convert_objects(convert_numeric=True) 
Out[13]: 
    x y 
0 a 1 
1 b 2 

In [14]: df.convert_objects(convert_numeric=True).dtypes 
Out[14]: 
x object 
y  int64 
dtype: object

魔術！

來源

2014-01-17 23:26:04

像'type.convert' in R一點點;不錯，但是在某些情況下會保留一個明確的規格。 – hatmatrix

@crippledlambda同意，我認爲這將是一個很好的功能要求，而不是太難實現。 –

如果您有一列需要是一個字符串，但至少包含一個可以轉換爲int的值，那麼請小心。它所需要的只是一個值，整個字段被轉換爲float64 –

對於那些從谷歌（等）來像我這樣的：

convert_objects已被棄用 - 如果你使用它，你會得到這樣一個警告：

FutureWarning: convert_objects is deprecated. Use the data-type specific converters 
pd.to_datetime, pd.to_timedelta and pd.to_numeric.

你應該這樣做像下面這樣：

df =df.astype(np.float)
df["A"] =pd.to_numeric(df["A"])

來源

2016-03-23 17:02:54

另一種方式來設置的列類型是首先構造一個numpy的記錄陣列以所需的類型，填好，然後將它傳遞給數據幀構造器。

import pandas as pd 
import numpy as np  

x = np.empty((10,), dtype=[('x', np.uint8), ('y', np.float64)]) 
df = pd.DataFrame(x) 

df.dtypes -> 

x  uint8 
y float64

來源

2016-07-02 04:49:52

面臨類似的問題給你。在我的情況下，我有1000個來自cisco日誌的文件，我需要手動解析。

爲了靈活處理字段和類型，我已經成功地使用StringIO + read_cvs進行了測試，它確實接受dtype規範的字典。

我通常會將每個文件（5k-20k行）放入緩衝區並動態創建dtype字典。

最終我將這些數據幀連接到一個大型數據框中，並將其轉儲到hdf5中。

東西沿着這些線路

import pandas as pd 
import io 

output = io.StringIO() 
output.write('A,1,20,31\n') 
output.write('B,2,21,32\n') 
output.write('C,3,22,33\n') 
output.write('D,4,23,34\n') 

output.seek(0) 


df=pd.read_csv(output, header=None, 
     names=["A","B","C","D"], 
     dtype={"A":"category","B":"float32","C":"int32","D":"float64"}, 
     sep="," 
     ) 

df.info() 

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 5 entries, 0 to 4 
Data columns (total 4 columns): 
A 5 non-null category 
B 5 non-null float32 
C 5 non-null int32 
D 5 non-null float64 
dtypes: category(1), float32(1), float64(1), int32(1) 
memory usage: 205.0 bytes 
None

不是很Python的....但做這項工作

希望它能幫助。

來源

2016-11-07 20:10:02

你可以明確地與大熊貓DataFrame.astype(dtype, copy=True, raise_on_error=True, **kwargs)設置的類型，並與你想要的dtypes字典傳遞給dtype

這裏有一個例子：

import pandas as pd 
wheel_number = 5 
car_name = 'jeep' 
minutes_spent = 4.5 

# set the columns 
data_columns = ['wheel_number', 'car_name', 'minutes_spent'] 

# create an empty dataframe 
data_df = pd.DataFrame(columns = data_columns) 
df_temp = pd.DataFrame([[wheel_number, car_name, minutes_spent]],columns = data_columns) 
data_df = data_df.append(df_temp, ignore_index=True) 

In [11]: data_df.dtypes 
Out[11]: 
wheel_number  float64 
car_name   object 
minutes_spent float64 
dtype: object 

data_df = data_df.astype(dtype= {"wheel_number":"int64", 
     "car_name":"object","minutes_spent":"float64"})

現在你可以看到，它變了

In [18]: data_df.dtypes 
Out[18]: 
wheel_number  int64 
car_name   object 
minutes_spent float64

來源

2017-04-08 01:26:14 Lauren

Y最好使用鍵入的np.arrays，然後將數據和列名作爲字典傳遞。

# Feature: np arrays are 1: efficient, 2: can be pre-sized 
x = np.array(['a', 'b'], dtype=object) 
y = np.array([ 1 , 2 ], dtype=np.int32) 
df = pd.DataFrame({ 
    'x' : x, # Feature: column name is near data array 
    'y' : y, 
    } 
)

來源

2018-03-08 22:25:59

分配大熊貓據幀列dtypes

回答

相關問題