我想分析一個大的(n = 500,000)文檔語料庫。我使用quanteda
期望will be faster比tm_map()
從tm
。我想要一步一步地執行,而不是使用dfm()
的自動方式。我有這樣的理由:在一種情況下,我不想在移除停用詞之前進行標記化,因爲這會導致許多無用的bigrams,在另一種情況下,我必須使用特定於語言的過程預處理文本。創建dfm一步一步與quanteda
謹以此順序實施:
1)刪除標點和數字
2),即標記化之前除去停用詞(以避免無用的令牌)
3)標記化使用unigram進行和雙字母組
4 )創建DFM
我嘗試:
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))
> class(text.corpus)
[1] "corpus" "list"
> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"
# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
獎金問題 如何刪除quanteda
中的稀疏令牌? (即在tm
的removeSparseTerms()
相當於
UPDATE 在@肯的回答的光,這裏是按部就班與quanteda
代碼:
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
1)刪除自定義標點符號和數字。例如。注意到,「\ n」,在ie2010語料庫
text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is
texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
上的原因的進一步注爲什麼一個人可能更願意預處理。我目前的語料庫是意大利語,這是一種用撇號連接單詞的文章。因此,直線dfm()
可能導致不精確的標記。 例如爲:
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
將產生相同的字兩個分離的標記(「un'abile」和「L'abile」),因此需要與gsub()
這裏的附加步驟的。
2)在quanteda
中,不可能在標記之前直接在文本中刪除停用詞。在我之前的例子中,「l」和「un」必須去掉,不要產生誤導性的bigrams。這可以在tm
與tm_map(..., removeWords)
處理。
3)符號化
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
4)創建DFM:
dfm <- dfm(token)
5)刪除稀疏特徵
dfm <- trim(dfm, minCount = 5)
爲了總結答案,可以使用'texts()'函數在'quanteda'中逐步進行: – 000andy8484