2016-11-19 36 views
1

獲得價值我有這兩個數據集:檢查單詞在字典中,並從另一列

stemmed <- data.frame(
    stem = c('super puper', 'only for you') 
) 


super <- data.frame(
    word = c('super', 'puper', 'you'), 
    weight = c(0.5, 0.1, 0.3) 
) 

我檢查,如果一個字是正和負的字典,並計算了多少次。我有這樣一個循環:

for (i in 1:nrow(stemmed)){ 
    words = strsplit(as.character(stemmed$stem)," ") 
    stemmed$super[i] <- sum(words[[i]] %in% super$word)/length(words[[i]]) 
} 

(順便說一句,如果你知道如何改進這個代碼,請告訴我。)

現在我想不僅計算詞的數量,但重量(包含在super$weight中的單詞權重的總和)。

於是,我就做這樣的事情在循環:

if (words[[i]] %in% super$word) { 
stemmed$super[i] = sum(with super[super$word==words[[i]],], 
         sum(super$weight))} 

我希望得到這樣一個數據幀:

stem    super 
super puper  0.6 
only for you  0.3 

我不`噸知道如何解決這個問題...

+1

'colSums(T(sapply(超$字,grepl,朵朵$幹))*超$權重)' – user20650

+0

下你的心流,在'match'可能是你需要的功能 –

回答

0

有很多方法可以做到這一點。 遵循你的方法,我想將它包裝成一個sapply

> final <- stemmed 
> final$super <- sapply(stemmed$stem, function(x) { 
    sum(super$weight[super$word %in% unlist(strsplit(as.character(x), " "))]) 
}) 
> final 
      stem super 
1 super puper 0.6 
2 only for you 0.3 
0
> data.frame(stem=stemmed$stem, 
     super=sapply(lapply(strsplit(as.character(stemmed$stem), " ") , 
          function(txt) super$word %in% txt), 
        function(idx) sum(super$weight[idx]))) 
      stem super 
1 super puper 0.6 
2 only for you 0.3 
0

我想我找到了適合自己的解決方案,但我用data.tables代替data.frames。這種解決方案的優點是不使用應用/循環。

library("data.table") 
library("reshape2") 
stemmed <- data.frame(
    stem = c('super puper', 'only for you') 
) 

super <- data.table(
    word = c('super', 'puper', 'you'), 
    weight = c(0.5, 0.1, 0.3) 
) 


# Step 1: Split the words 
split_words <- strsplit(as.character(stemmed$stem), " ") 
names(split_words) <- stemmed$stem 
# Step 2: melt it to a data.table 
result <- data.table(melt(split_words)) 
setnames(result, names(result), c("word", "stem")) 
# Step 3: Find the weight by merging it with super 
setkey(super, word) 
setkey(result, word) 
word_weights <- super[result] 
# Step 4: Filter the NA weights 
word_weights <- word_weights[!is.na(weight)] 
# Step 5: Now aggregate by stem to find the weight per stem 
final_result <- word_weights[, list(super = sum(weight)), by = stem] 
> final_result 
      stem super 
1: super puper 0.6 
2: only for you 0.3 
0

你可能想使用match

stemmed <- data.frame(
    stem = c('super puper', 'only for you') 
) 

super <- data.frame(
    word = c('super', 'puper', 'you'), 
    weight = c(0.5, 0.1, 0.3) 
) 

# this line may be out of loop 
words <- strsplit(as.character(stemmed$stem)," ") 

for (i in 1:nrow(stemmed)){ 
    stemmed$super[i] <- sum(words[[i]] %in% super$word)/length(words[[i]]) 
    # get weights for super words 
    w.index <- na.exclude(match(words[[i]],super$word)) 
    if (length(w.index) > 0) stemmed$super[i] <- sum(super$weight[w.index]) 

} 

#~ > stemmed 
#~   stem super 
#~ 1 super puper 0.6 
#~ 2 only for you 0.3