當我運行tfidf爲一組文件時,它返回了一個tfidf矩陣,看起來像這樣。什麼是理想的tfidf矩陣
(1, 12) 0.656240233446
(1, 11) 0.754552023393
(2, 6) 1.0
(3, 13) 1.0
(4, 2) 1.0
(7, 9) 1.0
(9, 4) 0.742540927053
(9, 5) 0.66980069547
(11, 19) 0.735138466738
(11, 7) 0.677916982176
(12, 18) 1.0
(13, 14) 0.697455191865
(13, 11) 0.716628394177
(14, 5) 1.0
(15, 8) 1.0
(16, 17) 1.0
(18, 1) 1.0
(19, 17) 1.0
(22, 13) 1.0
(23, 3) 1.0
(25, 6) 1.0
(26, 19) 0.476648253537
(26, 7) 0.879094103268
(28, 10) 0.532672175403
(28, 7) 0.523456282204
我想知道這是什麼,我無法理解這是怎麼提供。 當我處於調試模式時,我開始瞭解索引,indptr和數據......這些東西是與給定數據共同關聯的地方。這些是什麼? 數字有很多混亂,如果我說括號中的第一個元素是基於我的預測的文檔,我不會看到第0,第5,第6個文檔。 請幫我弄清楚它是如何在這裏工作的。然而,我知道wiki的tfidf的一般工作,記錄反向文檔和其他內容。我只是想知道這3種不同類型的數字是什麼,引用它是什麼?
的源代碼是:施加
#This contains the list of file names
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():
def kmeansClusters(self):
global _report
self.num_clusters = 5
km = KMeans(n_clusters=self.num_clusters)
vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
km.fit(self.tfidf_matrix)
self.clusters = km.labels_.tolist()
joblib.dump(km, 'doc_cluster2.pkl')
km = joblib.load('doc_cluster2.pkl')
class TokenizingAndPanda():
def tokenize_only(self,text):
'''
This function tokenizes the text
:param text: Give the text that you want to tokenize
:return: it gives the filter tokes
'''
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
def tokenize_and_stem(self,text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [_stemmer.stem(t) for t in filtered_tokens]
return stems
def getFilnames(self):
'''
:return:
'''
global _path
global _filenames
path = _path
_filenames = FileAccess().read_all_file_names(path)
def getContentsForFilenames(self):
global _contents
global _file_contents
for filename in _filenames:
content = FileAccess().read_the_contents_from_files(_path, filename)
_contents.append(content)
_file_contents[filename] = content
def createPandaVocabFrame(self):
global _totalvocab_stemmed
global _totalvocab_tokenized
#Enable this if you want to load the filenames and contents from a file structure.
# self.getFilnames()
# self.getContentsForFilenames()
# for name, i in _file_contents.items():
# print(name)
# print(i)
for i in _contents:
allwords_stemmed = self.tokenize_and_stem(i)
_totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = self.tokenize_only(i)
_totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
print(vocab_frame)
return vocab_frame
class TfidfProcessing():
def getTfidFPropertyData(self):
tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
min_df=0.02, stop_words='english',
use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
# print(_contents)
tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
return tfidf_matrix, terms, dist
你是在談論[scikit-learn tf-idf](http://scikit-learn.org/stable /modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)? 你可以把你的代碼的一部分,並從你想從中提取信息的文件中提取? – LoicM
雅它的scikit學習tf-idf是什麼在說話,以及我的代碼的一部分張貼在上面,希望你能幫助我 –