通過

2017-05-05 81 views
0

添加與失蹤多年的行我想在一個data.frame創建新行所有失蹤多年的每個組(公司和類型)。數據幀如下所示:通過

minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"), 
        type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"), 
        year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008), 
        value = c(1,3,7,9,9,2,3,3,7,5,9,15) 
       ) 

數據框:

firm type year value 
A X 2000  1 
A X 2004  3 
A X 2007  7 
B X 2010  9 
B X 2008  9 
B X 2001  2 
A Y 2002  3 
A Y 2003  3 
A Y 2007  7 
B Y 2000  5 
B Y 2001  9 
B Y 2008 15 

現在,我想是這樣的: 我可以在最小的年份是2000年的數據看,最大值爲我想爲每個公司類型的每個缺失年份添加一行。 例如對A公司和X型,我想補充的行,使得它看起來像這樣:

最終輸出:

firm type year value 
A X 2000  1 
A X 2004  3 
A X 2007  7 
A X 2001  1 
A X 2002  1 
A X 2003  1 
A X 2005  3 
A X 2006  3 
A X 2008  7 
A X 2009  7 
A X 2010  7 

此外,我想寫由上年的值入列「值「,直到出現一個新的非缺失行(如最終輸出示例所示)。

我還沒有拿出任何有用的代碼,但什麼到目前爲止,我所發現的是,這可能是正確的方向如下:

setDT(minimal)[, .SD[match(2000:2010, year)], 
          by = c("firm","type")] 

我真的不明白setDT的概念, .SD,但這會爲每個公司類型組合創建至少一行。但是,一年沒有內容。

非常感謝!

+0

我認爲這是有用的。檢查'complete'從'tidyr'或'expand.grid'從'基地R'或'CJ'從'data.table' – akrun

+0

好吧,我想出了'MIN2 < - ?expand.grid(年=分鐘(最低$年):最高(最低$年),公司=唯一的(最少$事務所),型=唯一的(最少$型))' 和'合併(MIN2,最小,通過= C( 「公司」,「類型「,」year「),all.x = T)'。現在我只需要爲每行添加正確的值,但我現在還不知道該怎麼做。 – Rnewbie

+0

試試這個:'library(dplyr);庫(tidyr);最小%>%group_by(公司,類型)%>%完成(年份= full_seq(year,1))%>%fill(值)' – Sotos

回答

0

我寫了這個代碼,你想要的是什麼,也許是不那麼有效或優雅,但它的工作原理:

# Input dataframe 
minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"), 
         type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"), 
         year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008), 
         value = c(1,3,7,9,9,2,3,3,7,5,9,15) 
) 

# Sorting is needed 
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),] 

# Variables used 
table = table(minimal$firm=="A", minimal$type=="X") 
minYear = min(minimal$year) 
maxYear = max(minimal$year) 
startPos = 0 

# Iterates the dataframe 
for(i in 1:2){ 
    for(j in 1:2){ 
    prevValue = 0 
    currYear = minYear 

    # Adds minimum year if needed 
    if(minimal$year[1+startPos] != currYear){ 
     newRow = c(as.character(minimal$firm[1+startPos]), as.character(minimal$type[1+startPos]), currYear, prevValue) 
     minimal = rbind(minimal, newRow) 
    } 

    # Adds years 
    for(k in (1+startPos):(table[i,j]+startPos)){ 
     if(minimal$year[k]!=currYear){ 
     currYear = currYear + 1 
     while(minimal$year[k]!=currYear){ 
      newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear, prevValue) 
      minimal = rbind(minimal, newRow) 
      currYear = currYear + 1 
     } 
     } 
     prevValue = minimal$value[k] 
    } 

    # Adds years from last to maximum 
    if(currYear < maxYear){ 
     for(l in 1:(maxYear - currYear)){ 
     newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear+l, prevValue) 
     minimal = rbind(minimal, newRow) 
     } 
    } 
    startPos = startPos + table[i,j] 

    } 
} 

# Result 
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),] 
minimal 
0

我找不到一個確切的重複數據刪除此所以這裏是一個可能的解決方案,

library(dplyr) 
library(tidyr) 

minimal %>% 
    group_by(firm, type) %>% 
    complete(year = full_seq(2000:2010, 1)) %>% 
    fill(value) 
0

這裏是一個data.table溶液。

library(data.table) 

dt <- setDT(minimal)[CJ(firm=firm, type=type, year=seq(min(year), max(year)), unique=TRUE), 
       on=.(firm, type, year), roll=TRUE] 

這返回

head(dt, 15) 
    firm type year value 
1: A X 2000  1 
2: A X 2001  1 
3: A X 2002  1 
4: A X 2003  1 
5: A X 2004  3 
6: A X 2005  3 
7: A X 2006  3 
8: A X 2007  7 
9: A X 2008  7 
10: A X 2009  7 
11: A X 2010  7 
12: A Y 2000 NA 
13: A Y 2001 NA 
14: A Y 2002  3 
15: A Y 2003  3 

注意,第二企業型組合的初始行是NA。如果您想在隨後的一年中填寫這些內容,則可以將填充參數調整爲「最接近」,儘管這可能會影響數據中間的值。