R-獲取documenttermmatrix中每個文檔的標記計數

我想這樣做的原因是我可以將絕對頻率轉換爲相對頻率。對於每個文檔都很容易獲得令牌計數，但我不確定如何獲取每個文檔的總令牌計數並同時使用它，因此我可以同時對每個文檔執行/總令牌計數，有沒有什麼方法可以綁定rowsums，然後使用計算中的列，如果這是正確的方法來做到這一點？R-獲取documenttermmatrix中每個文檔的標記計數

感謝

來源

2017-12-03 CodeCake

從英文版本的heliohost corpus爲我的文字數據的利用博客的數據，這是很容易通過quanteda包度日文件標記計數。

library(readr) 
library(quanteda) 
blogFile <- "./capstone/data/en_US.blogs.txt" 
inFile <- blogFile 
blogData <- read_lines(blogFile) 

system.time(theText <- corpus(blogData)) 

head(summary(theText))

...和輸出是：

> head(summary(theText)) 
Corpus consisting of 899288 documents, showing 100 documents: 

    Text Types Tokens Sentences 
text1 18  20   1 
text2  6  7   1 
text3 104 154   7 
text4 36  43   1 
text5 91 119   5 
text6 13  13   1 

Source: C:/Users/leona/gitrepos/datascience/* on x86-64 by leona 
Created: Sat Dec 02 20:59:23 2017 
Notes:  
>

來源

2017-12-03 02:07:28

謝謝。實際上，我想我找到了一種方法，用rowSums（dtm）來劃分。我希望這是正確的方法。

來源

2017-12-03 23:12:05 CodeCake

R-獲取documenttermmatrix中每個文檔的標記計數

回答

相關問題