pd.get_dummies（）在很大程度上很慢

我不確定這是否已經是最快的方法，或者如果我這樣做效率低下。pd.get_dummies（）在很大程度上很慢

我想熱編碼一個特定的具有27k +級別的分類列。列有2點不同的數據集不同的值，所以我第一次使用前get_dummies（）

def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True): 
    col1b = set(df2[column_name].unique()) 
    col1a = set(df[column_name].unique()) 
    combined_cats = list(col1a.union(col1b)) 
    df[column_name] = df[column_name].astype('category', categories=combined_cats) 
    df2[column_name] = df2[column_name].astype('category', categories=combined_cats) 

    df = pd.get_dummies(df, columns=[column_name],sparse=sparse) 
    df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse) 
    try: 
     del df[column_name] 
     del df2[column_name] 
    except: 
     pass 
    return df,df2

但是，它已經運行了2個多小時的聯合水平，它仍然停留熱碼。

我可以在這裏做錯嗎？還是僅僅是在大型數據集上運行它的本質？

Df有6.8m行和27列，Df2有19990行和27列，然後熱編碼我想要的列。

建議感激，謝謝！ :)

來源

2017-05-28 Wboy

'except：pass'總是錯的。我想你想'如果column_name在df：'而是。至於你的問題的其餘部分，你爲什麼不告訴我們哪一行需要很長時間？ –

@JohnZwinck謝謝你的輸入:)在這種情況下，我不認爲它真的很重要，請糾正我，如果我錯了。 – Wboy

@JohnZwinck正如我所提到的，get_dummies（）需要很長的時間 – Wboy

我簡要回顧了get_dummies source code，我認爲它可能沒有充分利用您的用例的稀疏性。下面的方法可以更快，但我並沒有試圖一路擴展它你有19M記錄：

import numpy as np 
import pandas as pd 
import scipy.sparse as ssp 

np.random.seed(1) 
N = 10000 

dfa = pd.DataFrame.from_dict({ 
    'col1': np.random.randint(0, 27000, N) 
    , 'col2b': np.random.choice([1, 2, 3], N) 
    , 'target': np.random.choice([1, 2, 3], N) 
    }) 

# construct an array of the unique values of the column to be encoded 
vals = np.array(dfa.col1.unique()) 
# extract an array of values to be encoded from the dataframe 
col1 = dfa.col1.values 
# construct a sparse matrix of the appropriate size and an appropriate, 
# memory-efficient dtype 
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8) 
# do the encoding. NB: This is only vectorized in one of the two dimensions. 
# Finding a way to vectorize the second dimension may yield a large speed up 
for idx, val in enumerate(vals): 
    spmtx[np.argwhere(col1 == val), idx] = 1 

# Construct a SparseDataFrame from the sparse matrix and apply the index 
# from the original dataframe and column names. 
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index, 
          columns=['col1_' + str(el) for el in vals]) 
dfnew.fillna(0, inplace=True)

UPDATE

借用其他答案見解here和here ，我能夠在兩個維度上矢量化解決方案。在我有限的測試中，我注意到構建SparseDataFrame似乎將執行時間增加了幾倍。因此，如果您不需要返回類似DataFrame的對象，則可以節省大量時間。此解決方案還處理您需要將2+ DataFrames編碼爲具有相同列數的2-d數組的情況。

import numpy as np 
import pandas as pd 
import scipy.sparse as ssp 

np.random.seed(1) 
N1 = 10000 
N2 = 100000 

dfa = pd.DataFrame.from_dict({ 
    'col1': np.random.randint(0, 27000, N1) 
    , 'col2a': np.random.choice([1, 2, 3], N1) 
    , 'target': np.random.choice([1, 2, 3], N1) 
    }) 

dfb = pd.DataFrame.from_dict({ 
    'col1': np.random.randint(0, 27000, N2) 
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2) 
    , 'target': np.random.choice([1, 2, 3], N2) 
    }) 

# construct an array of the unique values of the column to be encoded 
# taking the union of the values from both dataframes. 
valsa = set(dfa.col1.unique()) 
valsb = set(dfb.col1.unique()) 
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16) 


def sparse_ohe(df, col, vals): 
    """One-hot encoder using a sparse ndarray.""" 
    colaray = df[col].values 
    # construct a sparse matrix of the appropriate size and an appropriate, 
    # memory-efficient dtype 
    spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8) 
    # do the encoding 
    spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1 

    # Construct a SparseDataFrame from the sparse matrix 
    dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index, 
           columns=[col + '_' + str(el) for el in vals]) 
    dfnew.fillna(0, inplace=True) 
    return dfnew 

dfanew = sparse_ohe(dfa, 'col1', vals) 
dfbnew = sparse_ohe(dfb, 'col1', vals)

來源

2017-05-28 22:05:32 blueogive

嘿，謝謝你的回答！ :)這將如何處理第二個數據框中類別的會計問題？ – Wboy

你好！ :)如果我正確理解這一點，這隻會返回當前列作爲一個稀疏的數據幀，而不是將它合併到原始數據幀的權利？此外，我得到一個ValueError：無法強制當前fill_value南試圖將它uint8 dtype – Wboy

嗯，我不能重現ValueError。我使用熊貓0.20.1，這是最近才發佈的。如果您需要重新組合包含原始列的所有列的完整數據框，那麼可以在末尾添加此語句：'dfa = pd.concat（[dfanew，dfa.drop（ 'col1'，axis = 1）]，axis = 1）'。 – blueogive

pd.get_dummies（）在很大程度上很慢

回答

相關問題