-1
我是根據我的訓練數據集和計算概率來預測一些值,將它們相加總是給我1或100% 這是我的訓練數據概率之和總是給出在蟒蛇中的SGDClassifier中的1(100%)
Address Location_ID
Arham Brindavan,plot no.9,3rd road Near ls Stn,cannop 4485
Revanta,Behind nirmal puoto Mall, G-M link Road, Mulund(W) 10027
Sandhu Arambh,Opp St.Mary's Convent, rose rd, Mulund(W) 10027
Naman Premirer, Military Road, Marol Andheri E 5041
Dattatreya Ayuedust Adobe Hanspal, bhubaneshwar 6479
這是我的測試數據
Address Location_ID
Tata Vivati , Mhada Colony, Mulund (E), Mumbai 10027
Evershine Madhuvan,Sen Nagar, Near blue Energy,Santacruz(E) 4943
這是我曾嘗試
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
data=pd.read_csv('D:/All files/abc.csv')
msk = np.random.rand(len(data)) < 0.8
data_train = data[msk]
data_train_add = data_train.ix[:,0] # divide dataset into training set
data_train_loc = data_train.ix[:,1]
data_test1 = data[~msk]
data_test = data_test1.ix[:,0] # divide dataset into testing set
data_train_add = np.array(data_train_add)
data_train_loc = np.array(data_train_loc)
count_vect = CountVectorizer(ngram_range=(1,3))
X_train_counts = count_vect.fit_transform(data_train_add.ravel())
tfidf_transformer = TfidfTransformer()
data_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf_svm = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, n_iter=5, random_state=42).fit(data_train_tfidf, data_train_loc.ravel())
X_new_counts = count_vect.transform(data_test.ravel())
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted_svm = clf_svm.predict(X_new_tfidf)
clf_svm_prob=clf_svm.predict_proba(X_new_tfidf)
prob_sum=clf_svm_prob.sum(axis=1)
print(prob_sum)
O/P
array([ 1., 1., 1., 1.])
爲什麼它給出1或100%的概率,我應該改變哪個參數以便得到概率的總和。請提前致謝 提前致謝。
它將該樣本的所有類別的概率相加。顯然它會是1.你期望什麼?你能多解釋一下,你想達到什麼目的?你想對所有測試樣本的單個類別的概率進行求和嗎? –
@VivekKumar是的我期待它應該給我的每個單詞的測試記錄的概率的總和...例如,如果對於這個測試數據記錄(單詞)「Tata Vivati,Mhada Colony,Mulund(E),孟買「,概率爲0.00023,0.07693,0.28811,0.198827,0.123121,0.05920,那麼它應該只加上這些概率(將所有上述值相加得到大約0.737或73%) – deepesh
clf_svm是一個分類估計器。它不會輸出單詞概率,只有類。我無法理解你的字面概率是什麼意思。 –