2016-06-14 81 views
0

我想用更優雅的方式使用mutate進行一些列操作,因爲我的表中有超過200列,我希望使用mutate進行轉換。forloop inside dplyr mutate

這裏有一個例子

的樣本數據:

df <- data.frame(treatment=rep(letters[1:2],10), 
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20), 
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20), 
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20), 
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20), 
c_n=rnorm(20)) 

示例代碼:

dfm<-df %>% 
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n), 
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n), 
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n)) 
+2

「做下列事情」非常模糊,需要讀者跋涉通過你的代碼,你知道。你可以用文字來形容它。 – Frank

+1

你應該融化,以便你可以對'x','y','z'組進行分組操作。(實際上,從你的例子來看,它可能是熔化後的直線列算術。) – Gregor

+2

同意@Gregor;你也可以'tidyr :: gather()'(hadleyverse 2)而不是'reshape2:melt()'ing(hadleyverse 1) –

回答

3

儘管切線,使用tidyr功能的初步建議是,你需要去。這個功能管道似乎根據你提供的內容來完成這項工作。

您的數據:

df <- data.frame(treatment=rep(letters[1:2],10), 
       c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20), 
       c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20), 
       c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20), 
       c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20), 
       c_n=rnorm(20)) 
library(dplyr) 
library(tidyr) 

這第一輔助data.frame用於您c#_[xyz]變量轉換成一個統一的一個。我相信還有其他方法可以解決這個問題,但它的工作原理相對容易,可以根據您的200多列重現和擴展。

variableTransform <- data_frame(
    cnum = paste0("c", 1:12), 
    cvar = rep(paste0("a", 1:4), each = 3) 
) 
head(variableTransform) 
# Source: local data frame [6 x 2] 
# cnum cvar 
# <chr> <chr> 
# 1 c1 a1 
# 2 c2 a1 
# 3 c3 a1 
# 4 c4 a2 
# 5 c5 a2 
# 6 c6 a2 

這是一次性管道。我會在一秒內解釋步驟。您要查找的內容可能是treatment,xyzans列的組合。

df %>% 
    tidyr::gather(cnum, value, -treatment, -c_n) %>% 
    tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>% 
    left_join(variableTransform, by = "cnum") %>% 
    select(-cnum) %>% 
    tidyr::spread(cvar, value) %>% 
    mutate(
    ans = a1 * (a2/c_n) + a3 * (a4/c_n) 
) %>% 
    head 
# treatment  c_n xyz   a1   a2   a3   a4   ans 
# 1   a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448 
# 2   a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831 
# 3   a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283 
# 4   a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418 
# 5   a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681 
# 6   a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036 

首先,我們把原始數據和關閉所有(除二)列到「列名」和「列值」對兩列:

df %>% 
    tidyr::gather(cnum, value, -treatment, -c_n) %>% 
# treatment   c_n cnum  value 
# 1   a 0.20745647 c1_x -0.1250222 
# 2   b 0.01015871 c1_x -0.4585088 
# 3   a 1.65671028 c1_x -0.2455927 
# 4   b -0.24037137 c1_x 0.6219516 
# 5   a -1.16092349 c1_x -0.3716138 
# 6   b 1.61191700 c1_x 1.7605452 

這將有助於分裂c1_xc1x爲了平移第一和維持後者:

tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>% 
# treatment   c_n cnum xyz  value 
# 1   a 0.20745647 c1 x -0.1250222 
# 2   b 0.01015871 c1 x -0.4585088 
# 3   a 1.65671028 c1 x -0.2455927 
# 4   b -0.24037137 c1 x 0.6219516 
# 5   a -1.16092349 c1 x -0.3716138 
# 6   b 1.61191700 c1 x 1.7605452 

從這裏,讓我們來翻譯c1c2,並c3變量引入a1(重複其他9個變量)使用variableTransform

left_join(variableTransform, by = "cnum") %>% 
    select(-cnum) %>% 
# treatment   c_n xyz  value cvar 
# 1   a 0.20745647 x -0.1250222 a1 
# 2   b 0.01015871 x -0.4585088 a1 
# 3   a 1.65671028 x -0.2455927 a1 
# 4   b -0.24037137 x 0.6219516 a1 
# 5   a -1.16092349 x -0.3716138 a1 
# 6   b 1.61191700 x 1.7605452 a1 

由於我們要同時處理多個變量(用一個簡單的mutate),我們需要將一些變量回到列。 (我們之所以gather版和現在將spread幫助我保持組織的事情,並命名爲好。我相信有人能想出另一種方式來做到這一點。)

tidyr::spread(cvar, value) %>% head 
# treatment  c_n xyz   a1   a2   a3   a4 
# 1   a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 
# 2   a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 
# 3   a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 
# 4   a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 
# 5   a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 
# 6   a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 

從這裏,我們只需要mutate以獲得正確的答案。

0

與r2evans的回答類似,但更多的操作,而不是聯合(和更少的解釋)。

library(tidyr) 
library(stringr) 
library(dplyr) 

# get it into fully long form 
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>% 
    # separate off the xyz and the c123 
    separate(col = cc_xyz, into = c("cc", "xyz")) %>% 
    # extract the number 
    mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")), 
      # mod it by 4 for groupings and add a letter so its a good col name 
      num_mod = paste0("v", (num %% 4) + 1)) %>% 
    # remove unwanted columns 
    select(-cc, -num) %>% 
    # go into a reasonable data width for calculation 
    spread(key = num_mod, value = value) %>% 
    # calculate 
    mutate(result = v1 + v2/c_n + v3 + v4/c_n) 

# treatment   c_n xyz   v1   v2   v3   v4  result 
# 1   a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295 
# 2   a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188 
# 3   a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998 
# 4   a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405 
# 5   a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289 
# 6   a -0.911463819 z 0.957778345 -1.4 1.3959561507 -0.50021647 4.14947069 
# ...