2013-03-06 64 views
1

我試圖讓一個腳本來生成隨機的一組使用R.與人口統計信息的人,我希望它按行產生,而不是列,這樣的功能可以基於同一行中前一個函數的結果。我知道這可以用做對環(像我一樣下文),但for循環非常慢R.我已閱讀,你可以使用申請以更有效地做一個循環,但我的天堂」儘管許多嘗試失敗,但我仍然想到了如何。以下是帶有循環的功能代碼示例。我將如何做到這一點與適用替代for循環的唯一行,以填補data.frame

y <- 1980 ## MedianYr 
d <- 0.1 ## Rate of NA responses 

AgeFn <- function(y){ 
    Year <- 1900 + as.POSIXlt(Sys.Date())$year 
    RNormYr <- as.integer((rnorm(1)*10+y)) 
    Age <- Year - RNormYr 
} 

EduByAge <- function (Age, d) { 
    ifelse(Age < 17, sample(c("Some High School",NA), size=1,prob=c((1-d),d)), 
    ifelse(Age > 16 & Age < 19, sample(c("Some High School", "High School Grad",NA), size=1, prob=c(0.085, 0.604,d)), 
     ifelse(Age > 18 & Age < 21, sample(c("Some High School", "High School Grad", "Associates",NA), size=1,prob=c(0.085, 0.25, 0.354,d)), 
     ifelse(20 > Age & Age < 23, sample(c("Some High School", "High School Grad", "Associates", "Bachelors",NA), size=1,prob=c(0.085, 0.25, 0.075, 0.279,d)), 
      ifelse(Age > 22, sample(c("Some High School", "High School Grad", "Associates", "Bachelors", "Masters", "Professional", "Doctorate",NA),size=1,prob=c(0.085, 0.25, 0.075, 0.176, 0.072, 0.019, 0.012,d)), NA))))) 
} 

GenderFn <- function(d){ 
    Gender1 <- sample(c("Male","Female","Trans", NA), 1, replace=TRUE, prob=c(0.49, 0.5, 0.01, d)) 
    return(Gender1) 
} 

UserGen <- function(n,s) { 
    set.seed(s) 
    Rows <- function(y,d){ 
    Age <- abs(AgeFn(y)) 
    Gender <- GenderFn(d) 
    Education <- EduByAge(Age,d) 
    c(i, Age, Gender, Education) 
    } 
    df <- data.frame(matrix(NA, ncol = 4, nrow = n)) 
    for(i in (1:n)) { 
    df[i,] <- Rows(y,d) 
    } 
    colnames(df) <- c("ID", "Age", "Gender", "Education") 
    return(df) 
} 
+0

它看起來不像你的函數有任何從它們返回的東西。例如,'AgeFn'似乎沒有返回值。 – TARehman 2013-03-06 21:11:39

+0

@Tarehman來自'?「function」':「如果在不調用'return'的情況下達到某個函數的結尾,則返回上一個計算過的表達式的值。」 – 2013-03-06 21:20:00

+0

@BlueMagister Duh,我總是忘記了關於R.的錯誤。 – TARehman 2013-03-06 21:20:46

回答

1

所以,你寫你的代碼的方式意味着你最終將至少一個循環。

apply用於一個函數應用到每個另一結構的元件。所以,當你想將包含所有年齡的矢量傳遞給其他函數時,它將起作用。但是,它對於運行您的AgeFn()函數並不是那麼熱門,因爲這不會成爲您想要迭代的任何參數。

這裏有另一種可能,其中溝渠你贊成sample功能得到隨機年齡的方法。我做了一些假設,但我希望解釋可以幫助你找出如何在R.

y <- 1980  ## MedianYr 
d <- 0.1  ## Rate of NA responses 
agemin <- 14 
agemax <- 90 

# The stats guy in me thinks that you might have some 
# methodological problems here with how the ages are assigned 
# But I'm just going to stick with it for now 
EduByAge <- function (Age, d) { 
    ifelse(Age < 17, sample(c("Some High School",NA), size=1,prob=c((1-d),d)), 
      ifelse(Age > 16 & Age < 19, sample(c("Some High School", "High School Grad",NA), size=1, prob=c(0.085, 0.604,d)), 
        ifelse(Age > 18 & Age < 21, sample(c("Some High School", "High School Grad", "Associates",NA), size=1,prob=c(0.085, 0.25, 0.354,d)), 
         ifelse(20 > Age & Age < 23, sample(c("Some High School", "High School Grad", "Associates", "Bachelors",NA), size=1,prob=c(0.085, 0.25, 0.075, 0.279,d)), 
           ifelse(Age > 22, sample(c("Some High School", "High School Grad", "Associates", "Bachelors", "Masters", "Professional", "Doctorate",NA),size=1,prob=c(0.085, 0.25, 0.075, 0.176, 0.072, 0.019, 0.012,d)), NA))))) 
} 

NewUserGen <- function(n,s) { 

    set.seed(s) 

    ## Start by creating a data frame with IDs 
    fakedata <- data.frame(ID=1:n) 

    # Rather than a function, here I just used the built-in sample function 
    # I am sampling for n ages lying between agemin and agemax 
    # Using dnorm(), I assume a normal distribution of the ages, with 
    # mean age equal to today's year minus the "MedianYr" you were using above 
    # I assume that the mean and the SD are equal, you don't have to do that 

    # I put in a few extra carriage returns here to make things not quite so 
    # tight together - figured it would be easier to read. 
    fakedata$Age <- sample(x=agemin:agemax,size=n,replace=TRUE, 
          prob= 
          dnorm(agemin:agemax, 
          mean=abs(y-as.numeric(format.Date(Sys.Date(),"%Y"))), 
          sd=abs(y-as.numeric(format.Date(Sys.Date(),"%Y"))))) 

    # I'm sure you know this, but you have some issues here 
    # namely that you have a probability vector that totals to more than 1. 
    # You might be getting no NAs as a result. 
    fakedata$Gender <- sample(c("Male","Female","Trans", NA), 
           n, replace=TRUE, prob=c(0.49, 0.5, 0.01, d)) 

    # Here is the actually sapply() 
    fakedata$Edu <- sapply(fakedata$Age,FUN=EduByAge,d=0.1) 

    return(fakedata) 
} 

outdata <- NewUserGen(300,10201) 

這是數據的外觀合計後這一切工作:

outdata$Edu <- factor(outdata$Edu,levels=c("Some High School", 
              "High School Grad", 
              "Associates", 
              "Bachelors", 
              "Masters", 
              "Doctorate"),ordered=TRUE) 

hist(outdata$Age) 
barplot(table((outdata$Gender))) 
par(mai=c(3,1,1,1)) 
barplot(table((outdata$Edu)),las=2) 

Edu Distribution Gender Distribution Age Histogram

+0

因此,這是我的最快速度(100萬假冒用戶爲5.739秒)。年齡分佈的規範的使用是因爲這是一個社交媒體網站,所以我的假設是,它將分發青睞的年輕成年人口,而不是在整個人口均勻分佈。這是我第一次和R合作,所以你和其他人的回答都很有幫助。謝謝! – MaxF 2013-03-07 01:53:20

+0

你可以限制我的'dnorm'用來獲取更窄的正態曲線集中在年齡低一點的設置。 – TARehman 2013-03-07 18:58:27

0

我會修改Rows函數以獲取ID,而不是使用作用域「i」。

Rows <- function(i, y,d){ 
    Age <- abs(AgeFn(y)) 
    Gender <- GenderFn(d) 
    Education <- EduByAge(Age,d) 
    c(i, Age, Gender, Education) 
} 

然後,你可以調用函數與lapply:

res1 = lapply(1:3000, function(i){ 
    Rows(i, y, d) 
}) 

僅憑這一點並沒有真正提高速度,但如果你有多個內核的機器上,你也許能通過它的mclapply函數,從「多核」庫中獲得一些用處。

library("multicore") 
res2 = mclapply(1:3000, function(i){ 
    Rows(i, y,d) 
}) 

哦,如果你想使用結果作爲一個數據幀,你可以這樣做:

df = data.frame(do.call(rbind, res1)) 
0

爲主要功能,你可以使用的東西從apply家庭的功能,即replicate。速度的提升來自於一個事實,即R是分配,通過複製語言和for循環不必要拷貝數據幀:

UserGen2 <- function(n,s) { 
    set.seed(s) 
    Rows <- function(y,d) { 
    Age <- abs(AgeFn(y)) 
    Gender <- GenderFn(d) 
    Education <- EduByAge(Age,d) 
    c(Age, Gender, Education) 
    } 
    samp <- t(replicate(n,Rows(y,d))) 
    colnames(samp) <- c("Age","Gender","Education") 
    data.frame(ID=seq_len(dim(samp)[1]),samp) 
} 

可能有其他方面的改進,你可以做的一樣好。

+0

這是一個好的開始。放入10,000行時,您可以看到10倍的時差(請參見下文)。謝謝! '> system.time((UserGen(10000,5))) 用戶系統經過 11.076 4.577 16.483 > system.time((UserGen2(10000,5))) 用戶系統經過 1.552 0.011 1.603' – MaxF 2013-03-06 22:36:09