2014-09-03 77 views
3

我正在做一些文檔分類的工作,並使用sklearn的哈希向量化器,接着是tfidf轉換。如果Tfidf參數保留默認,我沒有問題。但是,如果我設置sublinear_tf=True,以下引發錯誤:Sublinear TF轉換導致sklearn中的ValueError

ValueError        Traceback (most recent call last) 
<ipython-input-16-137f187e99d8> in <module>() 
----> 5 tfidf.transform(test) 

D:\Users\DB\Anaconda\lib\site-packages\sklearn\feature_extraction\text.pyc in  transform(self, X, copy) 
    1020 
    1021   if self.norm: 
-> 1022    X = normalize(X, norm=self.norm, copy=False) 
    1023 
    1024   return X 

D:\Users\DB\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in normalize(X, norm, axis, copy) 
    533   raise ValueError("'%d' is not a supported axis" % axis) 
    534 
--> 535  X = check_arrays(X, sparse_format=sparse_format, copy=copy)[0] 
    536  warn_if_not_float(X, 'The normalize function') 
    537  if axis == 0: 

D:\Users\DB\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options) 
    272     if not allow_nans: 
    273      if hasattr(array, 'data'): 
--> 274       _assert_all_finite(array.data) 
    275      else: 
    276       _assert_all_finite(array.values()) 

D:\Users\DB\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite(X) 
    41    and not np.isfinite(X).all()): 
    42   raise ValueError("Input contains NaN, infinity" 
---> 43       " or a value too large for %r." % X.dtype) 
    44 
    45 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). 

我發現,導致錯誤文本的一個最小採樣並嘗試了一些診斷:

hv_stops = HashingVectorizer(ngram_range=(1,2), preprocessor=neg_preprocess, stop_words='english') 
tfidf = TfidfTransformer(sublinear_tf=True).fit(hv_stops.transform(X)) 
test = hv_stops.transform(X[4:6]) 
print np.any(np.isnan(test.todense())) #False 
print np.any(np.isinf(test.todense())) #False 
print np.all(np.isfinite(test.todense())) #True 
tfidf.transform(test) #Raises the ValueError 

是什麼原因造成的任何想法錯誤?如果需要更多信息,請告訴我。提前致謝!

編輯:

此單個文本項導致的錯誤對我來說:

hv_stops = HashingVectorizer(ngram_range=(1,3), stop_words='english', non_negative=True) 
item = u'b number b number b number conclusion no product_neg was_neg returned_neg for_neg evaluation_neg review of the medd history records did not find_neg any_neg deviations_neg or_neg anomalies_neg it is not suspected_neg that_neg the_neg product_neg failed_neg to_neg meet_neg specifications_neg the investigation could not verify_neg or_neg identify_neg any_neg evidence_neg of_neg a_neg medd_neg deficiency_neg causing_neg or_neg contributing_neg to_neg the_neg reported_neg problem_neg based on the investigation the need for corrective action is not indicated_neg should additional information be received that changes this conclusion an amended medd report will be filed zimmer considers the investigation closed this mdr is being submitted late as this issue was identified during a retrospective review of complaint files ' 
li = [item] 
fail = hv_stops.transform(li) 
TfidfTransformer(sublinear_tf=True).fit_transform(fail) 
+3

我認爲你需要將'non_negative = True'傳遞給'HashingVectorizer'。 Tf-idf未定義爲負值。 – 2014-09-03 18:07:13

+0

@larsmans它似乎沒有工作。我仍然收到相同的錯誤。我會找到引發錯誤並將其添加到帖子中的文本的最小示例。 – Shakesbeery 2014-09-04 15:58:20

+0

這可能是一個稀疏矩陣的問題?如果我在tfidf轉換之前調用'fail.todense()',它會正常工作。 – Shakesbeery 2014-09-04 17:00:37

回答

3

我找到了原因。 TfidfTransformer假定它獲得的稀疏矩陣是規範的,即它的data成員中不包含實際的零。但是,HashingVectorizer會生成一個包含存儲的零的稀疏矩陣。這會導致對數轉換生成-inf,而這又會導致歸一化失敗,因爲矩陣具有無限的範數。

這是scikit-learn中的一個錯誤;我做了一個report它,但我還不確定修復是什麼。