0
這是我的代碼,用於處理分組到不同關聯的數據觀察值。我想爲每個觀察計算他的描述與關聯之間在歐氏距離方面的距離。處理獨特的數據子集
for循環將組號碼上的數據表分組。 for循環的每次迭代都會選擇一組新的行進行處理。問題是我想存儲每次迭代的計算。我怎麼能這樣做?
希望情況能夠明確描述,歡迎提出問題。任何大的偏差形成當前的代碼或新方法的研究建議也是受歡迎的!
現狀:
association description group
1: zzzz zzzz 1
2: zzzz efgh 1
3: zzzz hijk 1
4: aaaa lmno 2
5: aaaa pqrs 2
6: aaaa tuvw 2
7: aaaa qyza 2
8: aaaa bcde 2
9: bbbb fqhij 3
10: cccc klmn 4
的理想解決方案:
association description group distance
1: zzzz zzzz 1 1
2: zzzz efgh 1 0
3: zzzz hijk 1 0
4: aaaa lmno 2 0
5: aaaa pqrs 2 0
6: aaaa tuvw 2 0
7: aaaa qyza 2 0
8: aaaa bcde 2 0
9: bbbb fqhij 3 0
10: cccc klmn 4 0
圖書館
library(tm)
library(dplyr)
函數來計算距離
euclidean.dist <- function(x1, x2) {
sqrt(sum((x1 - x2)^2))
}
數據說明
association <- c('zzzz', 'zzzz', 'zzzz', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'bbbb', 'cccc')
description <- c('zzzz', 'efgh', 'hijk', 'lmno', 'pqrs', 'tuvw', 'qyza', 'bcde', 'fqhij', 'klmn')
group <- c(1,1,1,2,2,2,2,2,3,4)
distance <- 0
mytable <- data.table(association,description,group,distance)
指數的for循環
ID <- length(unique(mytable$group))
爲了探個究竟,就目前而言,設置:
ID <- 1
對於循環本身
for(i in ID) {
#for each unique group, select only the rows of one group at a time
#Get only the description column
x1 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]]) %>%
select(description)
#For the same rows, select the specific association of the group of rows
x2 <- filter(mytable, group == seq_along(unique(mytable$group))[[i]] & row_number() == 1 | row_number()== n()) %>%
select(association)
#Rename the association column to description, so as to enable rbind
x2 <- rename(x2, description = association)
x3 <- rbind(x2, x1)
#Create distance column to store the values
x3$distance <- 0
#Transform to a corpus to weight the terms in each doc
mycorpus <- Corpus(DataframeSource(x3))
dtm <- DocumentTermMatrix(mycorpus,
control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
stopwords = FALSE))
#Create a matrix for measure
x4 <- as.matrix(dtm)
#Get all rows, except the first row
#The first row serves as input to calculate the euclidean for each row
rows <- (seq(1, nrow(x3) -1) +1)
#Calculate for all rows the distance
#Leave the first row empty, as it could be removed
for(a in rows) {
x3$distance[i] <- euclidean.dist(x4[1,], x4[a,])
}
}