2017-03-07 67 views
0

這是我的代碼,用於處理分組到不同關聯的數據觀察值。我想爲每個觀察計算他的描述與關聯之間在歐氏距離方面的距離。處理獨特的數據子集

for循環將組號碼上的數據表分組。 for循環的每次迭代都會選擇一組新的行進行處理。問題是我想存儲每次迭代的計算。我怎麼能這樣做?

希望情況能夠明確描述,歡迎提出問題。任何大的偏差形成當前的代碼或新方法的研究建議也是受歡迎的!

現狀:

 association description group 
1:  zzzz  zzzz  1   
2:  zzzz  efgh  1   
3:  zzzz  hijk  1   
4:  aaaa  lmno  2   
5:  aaaa  pqrs  2   
6:  aaaa  tuvw  2   
7:  aaaa  qyza  2   
8:  aaaa  bcde  2   
9:  bbbb  fqhij  3   
10:  cccc  klmn  4   

的理想解決方案:

 association description group distance 
1:  zzzz  zzzz  1  1 
2:  zzzz  efgh  1  0 
3:  zzzz  hijk  1  0 
4:  aaaa  lmno  2  0 
5:  aaaa  pqrs  2  0 
6:  aaaa  tuvw  2  0 
7:  aaaa  qyza  2  0 
8:  aaaa  bcde  2  0 
9:  bbbb  fqhij  3  0 
10:  cccc  klmn  4  0 

圖書館

library(tm) 
library(dplyr) 

函數來計算距離

euclidean.dist <- function(x1, x2) { 
    sqrt(sum((x1 - x2)^2)) 
} 

數據說明

association <- c('zzzz', 'zzzz', 'zzzz', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'bbbb', 'cccc') 
description <- c('zzzz', 'efgh', 'hijk', 'lmno', 'pqrs', 'tuvw', 'qyza', 'bcde', 'fqhij', 'klmn') 
group <- c(1,1,1,2,2,2,2,2,3,4) 
distance <- 0 

mytable <- data.table(association,description,group,distance) 

指數的for循環

ID <- length(unique(mytable$group)) 

爲了探個究竟,就目前而言,設置:

ID <- 1 

對於循環本身

for(i in ID) { 

#for each unique group, select only the rows of one group at a time 
#Get only the description column 
    x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>% 
    select(description) 

#For the same rows, select the specific association of the group of rows 
    x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>% 
select(association) 

#Rename the association column to description, so as to enable rbind 
    x2 <- rename(x2, description = association) 
    x3 <- rbind(x2, x1) 

#Create distance column to store the values 
    x3$distance <- 0 

#Transform to a corpus to weight the terms in each doc 
    mycorpus <- Corpus(DataframeSource(x3)) 
    dtm <- DocumentTermMatrix(mycorpus, 
         control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), 
             stopwords = FALSE)) 

#Create a matrix for measure 
    x4 <- as.matrix(dtm) 

#Get all rows, except the first row 
#The first row serves as input to calculate the euclidean for each row 
    rows <- (seq(1, nrow(x3) -1) +1) 

#Calculate for all rows the distance 
#Leave the first row empty, as it could be removed 
    for(a in rows) { 
    x3$distance[i] <- euclidean.dist(x4[1,], x4[a,]) 
    } 
} 

回答

0

以下內容用lapply代替for環路。我個人更喜歡在R中使用*apply系列函數,因爲它清楚它們會返回什麼,而for循環並不總是那麼清晰。

我們的工作基本上是一樣的,遍歷ID的每個元素的一系列函數。不,這sequen內for循環也被更改爲lapply

lapply(1:ID, function(i) { 

    x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>% 
    select(description) 

    x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>% 
    select(association) 

    x2 <- rename(x2, description = association) 
    x3 <- rbind(x2, x1) 

    x3$distance <- 0 

    mycorpus <- Corpus(DataframeSource(x3)) 
    dtm <- DocumentTermMatrix(mycorpus, 
          control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), 
              stopwords = FALSE)) 

    x4 <- as.matrix(dtm) 

    rows <- (seq(1, nrow(x3) -1) + 1) 

    lapply(rows, function(a) { 
    x3$distance[a] <<- euclidean.dist(x4[1, ], x4[a, ]) 
    }) 

    x3 %>% mutate(group = i) 

}) %>% 
    do.call(what = bind_rows)