2011-12-23 177 views
3

我使用R中的wordcloud軟件包創建了一個wordcloud,並且在「Word Cloud in R」的幫助下。如何從wordcloud中刪除單詞?

我可以很容易地做到這一點,但我想從這個wordcloud中刪除單詞。我在一個文件中有文字(實際上是一個excel文件,但我可以改變它),並且我想排除所有這些單詞,其中有幾百個單詞。有什麼建議麼?

require(XML) 
require(tm) 
require(wordcloud) 
require(RColorBrewer) 
ap.corpus=Corpus(DataframeSource(data.frame(as.character(data.merged2[,6])))) 
ap.corpus=tm_map(ap.corpus, removePunctuation) 
ap.corpus=tm_map(ap.corpus, tolower) 
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english"))) 
ap.tdm=TermDocumentMatrix(ap.corpus) 
ap.m=as.matrix(ap.tdm) 
ap.v=sort(rowSums(ap.m),decreasing=TRUE) 
ap.d=data.frame(word = names(ap.v),freq=ap.v) 
table(ap.d$freq) 
+4

代替或補充,在'禁用詞(「英語」)'添加停止詞從Excel文件也是如此。你可以合併單詞的矢量來製作一個停用詞的矢量。這些不在雲端。 – 2011-12-23 20:25:14

回答

3

@Tyler林克已經給出了答案,只需添加的removeWords()另一條線,但這裏的一些詳細信息。

比方說,您的Excel文件被稱爲nuts.xls,有字的一列這樣

stopwords 
peanut 
cashew 
walnut 
almond 
macadamia 

R你可以這樣進行

 library(gdata) # package with xls import function 
    library(tm) 
    # now load the excel file with the custom stoplist, note a few of the arguments here 
    # to clean the data by removing spaces that excel seems to insert and prevent it from 
    # importing the characters as factors. You can use any args from read.table(), which is 
    # handy 
    nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE) 

    # now make some words to build a corpus to test for a two-step stopword removal process... 
    words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but") 
    words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on") 
    words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an") 
    words.all<-data.frame(rbind(words1,words2,words3)) 
    words.corpus<-Corpus(DataframeSource((words.all))) 

    # now remove the standard list of stopwords, like you've already worked out 
    words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english")) 
    # now remove the second set of stopwords, this time your custom set from the excel file, 
    # note that it has to be a reference to a character vector containing the custom stopwords 
    words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords) 

    # have a look to see if it worked 
    inspect(words.corpus.nostopwords) 
    A corpus with 3 text documents 

    The metadata consists of 2 tag-value pairs and a data frame 
    Available tags are: 
      create_date creator 
    Available variables in the data frame are: 
      MetaID 

    $words1 
     , , , , apple, pear, orange, lime, mandarin, , , 

    $words2 
     , , , , apple, pear, orange, lime, mandarin, , , 

    $words3 
     , , , , apple, pear, orange, lime, mandarin, , , 

成功!標準停用詞不見了,就像excel文件中的自定義列表中的單詞一樣。毫無疑問,還有其他方法可以做到這一點。

+0

感謝Ben和Tin Man。兩者的某種組合爲我解決。我用gdata加載xls時遇到了麻煩,因爲如果屏蔽了它,那麼我的問題變成了excel和包含多個單詞的單元格的額外空間。儘管我欣賞這一切!謝謝! – user1108155 2011-12-27 17:12:17

0

將您想要創建datacloud的數據轉換爲數據框。 用您想要刪除的單詞創建一個CSV文件,並將其作爲數據框讀取。然後,您可以使一個anti_join