2016-06-07 41 views
0

我有一個大型的r數據框與接近500列。我想添加現有的縮放功能,並以列的方式嘗試不同的標準化功能。R:應用規範化函數列明智 - 大DataFrame/DataTable

由於現有規模功能

library(dplyr) 

set.seed(1234) 
dat <- data.frame(x = rnorm(10, 30, .2), 
        y = runif(10, 3, 5), 
        z = runif(10, 10, 20), k = runif(10, 5, 10)) 

dat %>% mutate_each_(funs(scale),vars=c("y","z")) 

問題1的: 在這種情況下瓦爾只有兩個,但是當你有500列,以標準化的最新最好的方法是什麼? 我嘗試以下操作:

dnot <- c("y", "z") 
dat %>% mutate_each_(funs(scale),vars=!(names(dat) %in% dnot)) 

錯誤:

Error in UseMethod("as.lazy_dots") : 
    no applicable method for 'as.lazy_dots' applied to an object of class "logical" 

問題2:除了使用內置的尺度函數我想申請我自己的函數數據標準化的框架。

例子:我有以下功能

normalized_columns <- function(x) 
{ 
    r <- (x/sum(x)) 
} 

問題2:我怎樣纔能有效地同時留下了只有3或4列這適用於所有列。

回答

1

有更好的方法,但我通常做類似:

set.seed(1234) 
x = rnorm(10, 30, .2) 
y = runif(10, 3, 5) 
z = runif(10, 10, 20) 
k = runif(10, 5, 10) 
a = rnorm(10, 30, .2) 
b = runif(10, 3, 5) 
c = runif(10, 10, 20) 
d = runif(10, 5, 10) 

normalized_columns <- function(x) 
{ 
x/sum(x) 
} 

dat<-data.frame(x,y,z,k,a,b,c,d) 
dat[,c(1,4,6:8)]<-sapply(dat[,c(1,4,6:8)], normalized_columns) 

編輯:至於效率去,這是相當快:

set.seed(100) 
dat<-data.frame(matrix(rnorm(50000, 5, 2), nrow = 100, ncol = 500)) 
cols<-sample.int(500, 495, replace = F) 
system.time(dat[,cols]<-sapply(dat[,cols], normalized_columns)) 
##user system elapsed 
##0.03 0.00 0.03 
1

由於OP使用dplyr方法,一個選項將使用setdiffmutate_each_

dat %>% 
    mutate_each_(funs(scale), setdiff(names(dat), dnot)) 
#    x  y  z   k 
#1 -0.8273937 3.633225 14.56091 0.22934964 
#2 0.6633811 3.605387 12.65187 0.76742806 
#3 1.4738069 3.318092 13.04672 -1.16688369 
#4 -1.9708424 3.079992 15.07307 0.62528427 
#5 0.8157183 3.437599 11.81096 -1.06313355 
#6 0.8929749 4.621197 17.59671 -0.06743894 
#7 -0.1923930 4.051395 12.01248 0.94484655 
#8 -0.1641660 4.829316 12.58810 -0.16575678 
#9 -0.1820615 4.662690 19.92150 -1.55940662 
#10 -0.5090247 3.091541 18.07352 1.45571106 

或子集基於邏輯指數names

dat %>% 
    mutate_each_(funs(scale), names(dat)[!names(dat) %in% dnot]) 
#   x  y  z   k 
#1 -0.8273937 3.633225 14.56091 0.22934964 
#2 0.6633811 3.605387 12.65187 0.76742806 
#3 1.4738069 3.318092 13.04672 -1.16688369 
#4 -1.9708424 3.079992 15.07307 0.62528427 
#5 0.8157183 3.437599 11.81096 -1.06313355 
#6 0.8929749 4.621197 17.59671 -0.06743894 
#7 -0.1923930 4.051395 12.01248 0.94484655 
#8 -0.1641660 4.829316 12.58810 -0.16575678 
#9 -0.1820615 4.662690 19.92150 -1.55940662 
#10 -0.5090247 3.091541 18.07352 1.45571106 

如果我們使用mutate_each,另一個選擇是one_of

dat %>% 
    mutate_each(funs(scale), -one_of(dnot)) 
#   x  y  z   k 
#1 -0.8273937 3.633225 14.56091 0.22934964 
#2 0.6633811 3.605387 12.65187 0.76742806 
#3 1.4738069 3.318092 13.04672 -1.16688369 
#4 -1.9708424 3.079992 15.07307 0.62528427 
#5 0.8157183 3.437599 11.81096 -1.06313355 
#6 0.8929749 4.621197 17.59671 -0.06743894 
#7 -0.1923930 4.051395 12.01248 0.94484655 
#8 -0.1641660 4.829316 12.58810 -0.16575678 
#9 -0.1820615 4.662690 19.92150 -1.55940662 
#10 -0.5090247 3.091541 18.07352 1.45571106 

setdiff選項與data.table

library(data.table) 
nm1 <- setdiff(names(dat), dnot) 
setDT(dat)[, (nm1) := lapply(.SD, scale), .SDcols = nm1]