TfIdfVectorizer將單詞分成單個字符？

我正試圖在一組描述中找到最近的鄰居。描述通常包含1-15個詞，我使用scikit的TfIdfVectorizer進行標記。然後，使用相同的矢量化器，我適合基本描述。然而，似乎是，矢量化分割這一個單獨的字符，而不是的話，因爲所得到的稀疏矩陣是形狀的[在語料庫中的唯一字基描述x個數量的字母]TfIdfVectorizer將單詞分成單個字符？

descriptions = 'total assets' 

products = LoadData('C:/dict.csv', dtype = {'Code': np.str, 'LocalLanguageLabel': np.str}) 
products = products.fillna({'LocalLanguageLabel':''}) 

from sklearn.feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer(token_pattern=r'\b\w+\b') 
#tried the below two as well 
#vectorizer = TfidfVectorizer() 
#vectorizer = TfidfVectorizer(token_pattern=r'\b\w+\b', analyzer = 'word') 
dict_matrix = vectorizer.fit_transform(products['LocalLanguageLabel']) 
input_matrix = vectorizer.transform(description) 

from sklearn.neighbors import NearestNeighbors 
model = NearestNeighbors(metric='euclidean', algorithm='brute') 
model.fit(dict_matrix) 

distance, indices = model.kneighbors(input_matrix,n_neighbors = 10)

當我打印input_matrix，這是我所得到的（你可以猜到的索引中涉及到字符「totalassets」）：

的預期

print(input_matrix) 
(0, 33478) 1.0 #t 
(1, 24021) 1.0 #o 
(2, 33478) 1.0 #t 
(3, 2298) 1.0 #a 
(4, 20272) 1.0 #l 
(6, 2298) 1.0 #a 
(7, 30874) 1.0 #s 
(8, 30874) 1.0 #s 
(9, 11386) 1.0 #e 
(10, 33478) 1.0 #t 
(11, 30874) 1.0 #s 

<12x39859 sparse matrix of type '<class 'numpy.float64'>' 
with 11 stored elements in Compressed Sparse Row format>

是什麼？我期望10個距離和10個索引，而不是我得到12個每個10個元素的列表。

來源

2016-07-22 śmiglidigli

沒錯，答案對我花在它上面的時間非常簡單。我將description包裝在一個清單中，並得到了預期的10個結果：

input_matrix = vectorizer.transform([description])

來源

2016-07-22 21:55:31

TfIdfVectorizer將單詞分成單個字符？

回答

相關問題