我對那些希望推廣到更大人羣的樣本進行大量工作。但是，大多數時候樣本都有偏差，需要用survey包進行加權。但是，我還沒有找到一種方法來對這些權重的術語文檔矩陣加權。考慮這個例子在TermDocumentMatrix中使用調查軟件包中的權重

library(tm) 
library(wordcloud) 

set.seed(123) 

# Consider this example: I have performed a sample from a population and now have 
# 1000 observations of text. In the data I also have information about gender. 

# The sample 
data <- rbind(data.frame(gender = "M", 
        words = sample(c("education", "money", "family", 
            "house", "debts"), 
            600, replace = TRUE)), 
       data.frame(gender = "F", 
        words = sample(c("career", "bank", "friends", 
             "drinks", "relax"), 
            400, replace = TRUE))) 
# I create a simple wordcloud 
text <- paste(data$words, collapse = " ") 
matrix <- as.matrix(
    TermDocumentMatrix(
    VCorpus(
     VectorSource(text) 
    ) 
) 
)

其產生的wordcloud，看起來是這樣的：

正如你所看到的，男性中提到的術語是更大的，因爲出現更頻繁。但是，我知道這個人口的真實分佈，因此這個wordcloud是有偏見的。

真正性別分佈

true_gender_dist <- data.frame(gender = c("M", "F"), freq = nrow(data) * c(0.49,0.51))

隨着調查包我可以用耙功能

library(survey) 
rake_data <- rake(design = svydesign(ids = ~1, data = data), 
        sample.margins = list(~gender), 
        population.margins = list(true_gender_dist))

爲了使用權重分析中，可視化等加權數據（即是未包含在調查包中）我將權重添加到原始數據。

data_weighted <- cbind(data, data.frame(weights = weights(rake_data)))

到目前爲止好。不過，我想寫一個將這些權重考慮在內的wordcloud。

我的第一次嘗試是在製作術語文檔矩陣時使用權重。

text_corp <- VCorpus(VectorSource(text)) 
w_tdm <- TermDocumentMatrix(text_corp, 
           control = list(weighting = weights(rake_data)))

但後來我得到：

Error in .TermDocumentMatrix(m, weighting) : invalid weighting

這是在所有可能的？

來源

2016-11-28 FilipW

在示例中，您不需要'樣本'作爲'性別'列。 'data.frame（gender = 1，...'will do –

您可以使用[inverse document frequency（idf）]（https://en.wikipedia.org/wiki/Tf%E2%80%93idf）來加權術語頻率。或者只是按照每個性別的調查數量來劃分每個性別的詞頻。 – emilliman5

是的，@ emilliman5，這是我想到的那種東西。只是不知道我會如何編程。猜猜我將不得不嘗試使用TM包，它具有指定權重的功能。由於權重也可能將事情視爲政治偏見，年齡等，我正在尋找更復雜的方式。 – FilipW

我不能評論，所以我將使用的答案評論你的問題：

你可能有興趣在R包STM（結構化的主題模型）。它提供了推斷關於元變量（連續和/或離散）的潛在主題的可能性。

可以產生不同類型的地塊退房元變量如何影響

一）根據所選擇的主題，

B）一個主題內的首選話，

c）和一些:)

一些鏈接，如果你有興趣：

Paper describing the R package

R documentation

Some more Papers < - 這是一個很好的集合，如果你想潛入受一些！

來源

2016-11-30 11:40:07 Jakob

感謝您的提示。有趣的包。但是，如果我沒有記錯，TM-package也提供了存儲元變量的能力，但是，stm-modeling是有趣的。不過，並不是我正在尋找的東西。按照最基本的形式，我有興趣根據元變量給出每個詞的頻率權重。 – FilipW

在TermDocumentMatrix中使用調查軟件包中的權重

這是在所有可能的？

回答

相關問題