2016-08-17 101 views
0

我正在處理每個以前的作業都是excel文件中的行的作業應用程序數據。我想轉換數據集,以便每個過去的僱主1,2,3,4等都有列...根據唯一值的數量在R中重塑數據幀

我認爲這個問題最好用一個例子來解釋。我如何從開始數據幀到所需的數據幀?

我嘗試了一些熔鍊和鑄造,但我陷入困境,因爲我不想爲每個獨特的公司名稱創建一列,而是基於唯一公司名稱的數量。

id <- c(1000,1000,1002,1007,1007,1007,1007,1009) 
employers <-c("Ikea","Subway","DISH","DISH","Ikea","Starbucks","Google","Google") 
start_date <- c("2/1/2013","5/1/2000","4/1/2012","3/1/2014","8/15/2011","4/15/2008","2/1/2004","3/15/2010") 
start <- data.frame(cbind(id,employers,start_date)) 
colnames(start) <- c("id","employers","start_date") 

start 

unique_id <- c(1000,1002,1007,1009) 
emp1 <- c("Ikea","DISH","DISH","Google") 
emp2 <- c("Subway",NA,"Ikea",NA) 
emp3 <- c(NA,NA,"Starbucks",NA) 
emp4 <- c(NA, NA,"Google",NA) 
emp1_start <- c("2/1/2013","4/1/2012","3/1/2014","3/15/2010") 
emp2_start <- c("5/1/2000",NA,"8/15/2011",NA) 
emp3_start <- c(NA,NA,"4/15/2008",NA) 
emp4_start <- c(NA,NA,"2/1/2004",NA) 
desired <- data.frame(cbind(unique_id,emp1,emp2,emp3,emp4,emp1_start,emp2_start,emp3_start,emp4_start)) 

desired 
+0

'start $ time < - with(start,ave(as.character(id),id,FUN = seq_along));從另一個答案重新設置(start,direction =「wide」,idvar =「id」,sep =「」))。 – thelatemail

+0

你忘了重新命名列:-)(只是在開玩笑......你的編程器能夠輕鬆擊敗我)。 – r2evans

+0

感謝@thelatemail發現重複並使用我的示例發佈答案。按照預期的方式創建timevar可以很好地處理我的實際數據,並且它更大更復雜。 – andrea

回答

0

使用您的數據(有意與factor S,很容易與stringsAsFactors = FALSE修復):

start <- data.frame(
      id=c( "1000",  "1000",  "1002",  "1007", 
        "1007",  "1007",  "1007",  "1009"), 
    employers=c( "Ikea", "Subway",  "DISH",  "DISH", 
        "Ikea", "Starbucks", "Google", "Google"), 
    start_date=c("2/1/2013", "5/1/2000", "4/1/2012", "3/1/2014", 
       "8/15/2011", "4/15/2008", "2/1/2004", "3/15/2010") 
) 

將這項工作的嗎?

library(dplyr) 
library(tidyr) 

a <- start %>% 
    select(-start_date) %>% 
    group_by(id) %>% 
    mutate(emp = sprintf("emp%s", seq_len(n()))) %>% 
    ungroup() %>% 
    spread(emp, employers) 

b <- start %>% 
    select(-employers) %>% 
    group_by(id) %>% 
    mutate(emp = sprintf("emp%s_start", seq_len(n()))) %>% 
    ungroup() %>% 
    spread(emp, start_date) 

left_join(a, b, by = "id") 
# # A tibble: 4 x 9 
#  id emp1 emp2  emp3 emp4 emp1_start emp2_start emp3_start emp4_start 
# <fctr> <fctr> <fctr> <fctr> <fctr>  <fctr>  <fctr>  <fctr>  <fctr> 
# 1 1000 Ikea Subway  NA  NA 2/1/2013 5/1/2000   NA   NA 
# 2 1002 DISH  NA  NA  NA 4/1/2012   NA   NA   NA 
# 3 1007 DISH Ikea Starbucks Google 3/1/2014 8/15/2011 4/15/2008 2/1/2004 
# 4 1009 Google  NA  NA  NA 3/15/2010   NA   NA   NA 
+0

謝謝@ r2evens。我將堅持這個未來。它對我簡單的例子非常有用,但對於過去的學校和相關日期,GPA等也有多行的實際數據有點麻煩,所以select()部分不是直截了當的。 – andrea