2
任何人都可以解釋嗎?dtm稀疏度取決於tf/tfidf,同一語料庫
我的理解:
tf >= 0 (absolute frequency value)
tfidf >= 0 (for negative idf, tf=0)
sparse entry = 0
nonsparse entry > 0
所以確切的稀疏/非稀疏的比例應與下面的代碼創建的兩個DTM的相同。
library(tm)
data(crude)
dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2
但是:
> dtm
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2255/23065**
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency (tf)
> dtm2
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2215/23105**
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)