錯誤FeatureUnion Sklearn管道

我有以下數據框：錯誤FeatureUnion Sklearn管道

ID Text 
1 qwerty 
2 asdfgh

我想創建md5哈希文本字段，並從上述數據幀刪除ID場。爲了實現這一點，我創建了一個簡單的pipeline與從sklearn定製變壓器。

這裏是我使用的代碼：

class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin): 
    def __init__(self, key): 
     self.key = key 
    def fit(self, x, y=None): 
     return self 

    def hash_generate(self, txt): 

     m = hashlib.md5() 
     text = str(txt) 
     long_text = ' '.join(text.split()) 
     m.update(long_text.encode('utf-8')) 
     text_hash= m.hexdigest() 
     return text_hash 

    def transform(self, x): 
     return x[self.key].apply(lambda z: self.hash_generate(z)).values 

class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin): 
    def fit(self, x, y=None): 
     return self 
    def transform(self, x): 
     x = x.drop(['Gene', 'Variation','ID','Text'], axis=1) 
     return x.values 

fp = pipeline.Pipeline([ 

('union', pipeline.FeatureUnion([ 
     ('hash', cust_txt_col('Text')), # can pass in either a pipeline 
     ('normalized', cust_regression_vals()) # or a transformer 
    ])) 
])

當我運行此我收到follwoing錯誤：

ValueError: all the input arrays must have same number of dimensions

你能不能，請告訴我什麼是錯我的代碼？

如果我運行類逐個：

爲cust_txt_col I中得到O/P

['3e909f222a1e06098ec7ca1ea7e84540' '1691bdba3b75df145169e0501369fce3' 
'1691bdba3b75df145169e0501369fce3' ..., 'e11ec9863aaeb93f77a231319021e14d' 
'851c517b2af0a46cb9bc9373b748b6ff' '0ffe46fc75d21a5347b1f1a5a84526ad']

爲cust_regression_vals I中得到O/P

[[qwerty], 
    [asdfgh]]

來源

2017-07-15 Backtrack

不應該是'cust_txt_col（dataframe ['Text']）'？另外，如果你逐個運行類，你會得到什麼輸出？ –

@ E.Z。用類o/p – Backtrack

提升了我的帖子問題可能是'cust_regression_vals'形狀;嘗試在第二個類的末尾添加'return x.ravel（）。values'並驗證它是否正確。如果沒有，你可以發佈'cust_txt_col.shape'的輸出嗎？ –

cust_txt_col正在返回一個1d數組。 FeatureUnion要求每個組分變換器返回一個2d陣列。

來源

2017-07-17 01:35:47 joeln

也建議OP @Backtrack如何解決這個問題。就像使用reshape（）更改transform（）的輸出形狀一樣。那麼這個答案就完成了。 –

錯誤FeatureUnion Sklearn管道

回答

相關問題