2016-11-29 66 views
2

任何人都可以解釋嗎?dtm稀疏度取決於tf/tfidf,同一語料庫

我的理解:

tf >= 0 (absolute frequency value) 

tfidf >= 0 (for negative idf, tf=0) 



sparse entry = 0 

nonsparse entry > 0 

所以確切的稀疏/非稀疏的比例應與下面的代碼創建的兩個DTM的相同。

library(tm) 
data(crude) 

dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf)) 
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf)) 
dtm 
dtm2 

但是:

> dtm 
<<DocumentTermMatrix (documents: 20, terms: 1266)>> 
**Non-/sparse entries: 2255/23065** 
Sparsity   : 91% 
Maximal term length: 17 
Weighting   : term frequency (tf) 
> dtm2 
<<DocumentTermMatrix (documents: 20, terms: 1266)>> 
**Non-/sparse entries: 2215/23105** 
Sparsity   : 91% 
Maximal term length: 17 
Weighting   : term frequency - inverse document frequency (normalized) (tf-idf) 

回答

3

稀疏性可以不同。如果TF爲零或者IDF爲零,那麼TF-IDF值將爲零;如果每個文檔中出現術語,則IDF爲零。請看下面的例子:

txts <- c("super World", "Hello World", "Hello super top world") 
library(tm) 
tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf)) 
tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf)) 

inspect(tf) 
# <<TermDocumentMatrix (terms: 4, documents: 3)>> 
# Non-/sparse entries: 8/4 
# Sparsity   : 33% 
# Maximal term length: 5 
# Weighting   : term frequency (tf) 
# 
#  Docs 
# Terms 1 2 3 
# hello 0 1 1 
# super 1 0 1 
# top 0 0 1 
# world 1 1 1 

inspect(tfidf) 
# <<TermDocumentMatrix (terms: 4, documents: 3)>> 
# Non-/sparse entries: 5/7 
# Sparsity   : 58% 
# Maximal term length: 5 
# Weighting   : term frequency - inverse document frequency (normalized) (tf-idf) 
# 
#  Docs 
# Terms   1   2   3 
# hello 0.0000000 0.2924813 0.1462406 
# super 0.2924813 0.0000000 0.1462406 
# top 0.0000000 0.0000000 0.3962406 
# world 0.0000000 0.0000000 0.0000000 

術語超級發生在文件1,其中有2項1次,並在2開出3文檔時出現:

1/2 * log2(3/2) 
# [1] 0.2924813 

術語世界在文檔3中出現1次,其中有4個項,並且它出現在所有3個文檔中:

1/4 * log2(3/3) # 1/4 * 0 
# [1] 0