使用text2vec包進行文本預處理和主題建模

我有大量文檔，並且想使用text2vec和LDA（Gibbs Sampling）進行主題建模。使用text2vec包進行文本預處理和主題建模

步驟我需要的是爲（按順序）：

從文本中刪除數字和符號

library(stringr) 
docs$text <- stringr::str_replace_all(docs$text,"[^[:alpha:]]", " ") 
docs$text <- stringr::str_replace_all(docs$text,"\\s+", " ")

移除停止字

library(text2vec)   
library(tm) 

stopwords <- c(tm::stopwords("english"),custom_stopwords) 

prep_fun <- tolower 
tok_fun <- word_tokenizer 
tok_fun <- word_tokenizer  
tokens <- docs$text%>% 
     prep_fun %>% 
     tok_fun 
it <- itoken(tokens, 
      ids = docs$id, 
      progressbar = FALSE) 

v <- create_vocabulary(it, stopwords = stopwords) %>% 
    prune_vocabulary(term_count_min = 10) 

vectorizer <- vocab_vectorizer(v)

通過替換同義詞條款

我有一個excel文件，其中第一列是主詞，同義詞列在第二，第三和...列中。我想用主詞（第1列）替換所有的同義詞。每個術語可以有不同數量的同義詞。下面是使用「TM」包的代碼的一個例子（但我對到所述一箇中text2vec包）：

replaceSynonyms <- content_transformer(function(x, syn=NULL) 
     {Reduce(function(a,b) { 
     gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word,  a, perl = TRUE)}, syn, x) }) 

l <- lapply(as.data.frame(t(Synonyms), stringsAsFactors = FALSE), # 
      function(x) { 
      x <- unname(x) 
      list(word = x[1], syns = x[-1]) 
      }) 
names(l) <- paste0("list", Synonyms[, 1]) 
list2env(l, envir = .GlobalEnv) 

synonyms <- list()   
for (i in 1:length(names(l))) synonyms[i] = l[i] 

MyCorpus <- tm_map(MyCorpus, replaceSynonyms, synonyms)

轉換爲文檔詞矩陣

dtm <- create_dtm(it, vectorizer)

應用LDA模型上的文檔詞矩陣

doc_topic_prior <- 0.1 # can be chosen based on data? 
lda_model <- LDA$new(n_topics = 10, 
      doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01) 
doc_topic_distr <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol <- 0.01, check_convergence_every_n = 10)

步驟3中的MyCorpurs是使用「tm」包獲得的語料庫。步驟2和步驟3不一起工作，因爲步驟2的輸出是詞彙表，但步驟3的輸入是「tm」語料庫。

我的第一個問題是，我怎麼能使用text2vec包（和兼容包）來做所有的步驟，因爲我發現它非常高效;感謝Dmitriy Selivanov。

第二：如何在步驟5中設置LDA中參數的最優值？是否可以根據數據自動設置它們？

感謝曼努埃爾比克爾在我的文章更正。

感謝，山姆

來源

2017-10-20 Sam S

響應您的評論更新的答案：

第一個問題：Replace words in text2vec efficiently：同義詞替換的問題已經在這裏找到答案。請檢查count的答案。模式和替換可能是ngram（多個單詞短語）。請注意，Dmitriy Selivanov的第二個答案使用word_tokenizer()，並不涵蓋所呈現形式的ngram替換情況。

是否有任何理由需要在停用詞清除之前替換同義詞？通常這個順序不應該引起問題;或者您是否有一個切換順序產生重大不同結果的示例？如果你真的想在停用詞刪除後替換同義詞，我想，當使用text2vec時，你將不得不對dtm應用這樣的更改。如果你這樣做了，你需要允許你的dtm中的ngram包含在你的同義詞中，並且最小的ngram長度。作爲一種選擇，我在下面的代碼中提供了一種解決方法。請注意，在dtm中允許更高的ngram會產生噪聲，這可能會影響或不會影響您的下游任務（您可能會刪除詞彙步驟中的大部分噪聲）。因此，以前替換ngram似乎是更好的解決方案。

第二個問題：你可能會檢查textmineR包，可幫助您選擇主題的最佳數量或也回答了這個問題Topic models: cross validation with loglikelihood or perplexity的包（和源代碼）。關於處理先驗問題，我還沒有弄清楚，如何處理這些包（例如text2vec（WarpLDA算法），lda（Collabed Gibbs Sampling算法等）或topicmodels（'標準'吉布斯採樣和變分期望最大化算法）值詳細。作爲一個起點，您可以查看topicmodels的詳細文檔，第2.2章「估計」告訴您如何估計在「2.1模型規範」中定義的alpha和beta參數。

對於學習的目的，請注意，你的代碼中產生的誤差在兩點，我已經修訂：（1）你需要使用的正確名稱爲create_vocabulary()停用詞，禁用詞代替STOP_WORDS，因爲你定義名稱爲（2）您的lda模型定義中不需要vocabulary =... - 也許您使用舊版本的text2vec？

library(text2vec) 
library(reshape2) 
library(stringi) 

#function proposed by @count 
mgsub <- function(pattern,replacement,x) { 
    if (length(pattern) != length(replacement)){ 
    stop("Pattern not equal to Replacment") 
    } 
    for (v in 1:length(pattern)) { 
    x <- gsub(pattern[v],replacement[v],x, perl = TRUE) 
    } 
    return(x) 
} 

docs <- c("the coffee is warm", 
      "the coffee is cold", 
      "the coffee is hot", 
      "the coffee is boiling like lava", 
      "the coffee is frozen", 
      "the coffee is perfect", 
      "the coffee is warm almost hot" 
) 

synonyms <- data.frame(mainword = c("warm", "cold") 
         ,syn1 = c("hot", "frozen") 
         ,syn2 = c("boiling like lava", "") 
         ,stringsAsFactors = FALSE) 

synonyms[synonyms == ""] <- NA 

synonyms <- reshape2::melt(synonyms 
          ,id.vars = "mainword" 
          ,value.name = "synonym" 
          ,na.rm = TRUE) 

synonyms <- synonyms[, c("mainword", "synonym")] 


prep_fun <- tolower 
tok_fun <- word_tokenizer 
tokens <- docs %>% 
    #here is where you might replace synonyms directly in the docs 
    #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], .) } %>% 
    prep_fun %>% 
    tok_fun 
it <- itoken(tokens, 
      progressbar = FALSE) 

v <- create_vocabulary(it, 
         sep_ngram = "_", 
         ngram = c(ngram_min = 1L 
           #allow for ngrams in dtm 
           ,ngram_max = max(stri_count_fixed(unlist(synonyms), " ")) 
           ) 
) 

vectorizer <- vocab_vectorizer(v) 
dtm <- create_dtm(it, vectorizer) 

#ngrams in dtm 
colnames(dtm) 

#ensure that ngrams in synonym replacement table have the same format as ngrams in dtm 
synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x)) 

colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm)) 


#only zeros/ones in dtm since none of the docs specified in my example 
#contains duplicate terms 
dim(dtm) 
#7 24 
max(dtm) 
#1 

#workaround to aggregate colnames in dtm 
#I think there is no function `colsum` that allows grouping 
#therefore, a workaround based on rowsum 
#not elegant because you have to transpose two times, 
#convert to matrix and reconvert to sparse matrix 
dtm <- 
    Matrix::Matrix(
    t(
     rowsum(t(as.matrix(dtm)), group = colnames(dtm)) 
    ) 
    , sparse = T) 


#synonyms in columns replaced 
dim(dtm) 
#7 20 
max(dtm) 
#2

來源

2017-10-20 12:51:54

非常感謝您的回答。其實我有大量的拼寫錯誤和縮寫的數據，也是同一個詞的不同縮寫。主詞只是一個詞，但同義詞可以是諸如「熱水」之類的詞組。我需要先刪除停用詞（我的問題中的第2步），然後用主詞替換多個同義詞。我如何按順序完成這兩個步驟，即先刪除停用詞，然後替換同義詞？我做了所有使用「tm」和「topicmodels」包的工作，但它們非常慢，我想切換到text2vec。 –

我意識到你的問題的一部分已經在其他地方得到了回答。我已經相應地更新了我的答案，幷包含了該答案的鏈接。 –

感謝Manuel的更新。在ngram之前刪除一些停用詞讓我更容易關注重要的ngrams /短語。例如，「返回工作」，「返回工作」，「返回工作」全部被替換爲返工。我有很多這種類型的短語。 –

使用text2vec包進行文本預處理和主題建模

回答

相關問題