2014-12-02 162 views
2

我正在尋找一種方法將我的數據幀拆分爲多行。將R數據幀拆分爲幾行

我的測試輸入數據看起來像這樣

data <- read.table(text ="group; yr1; yr2; val; col2 
    a; 1927; 1934; -140; coltest 
    a; 1953; 1955; -480; coltest 
    b; 1957; 1958; -280; coltest1 
    b; 1961; 1965; -1420; coltest1 ", sep=";", header=T,stringsAsFactors = FALSE) 

我所尋找的是來計算,每年的價值和它在一排這樣寫出來的方式:

group; yr1; yr2; val; col2 
    a; 1927; 1928; -20; coltest 
    a; 1928; 1929; -20; coltest 
    a; 1929; 1930; -20; coltest 
    a; 1930; 1931; -20; coltest 
    a; 1931; 1932; -20; coltest 
    a; 1932; 1933; -20; coltest 
    a; 1933; 1934; -20; coltest 
    a; 1953; 1954; -240; coltest 
    a; 1954; 1955; -240; coltest 
    b; 1957; 1958; -280; coltest1 
    b; 1961; 1962; -355; coltest1 
    b; 1962; 1963; -355; coltest1 
    b; 1963; 1964; -355; coltest1 
    b; 1964; 1965; -355; coltest1 

我可以像這樣計算一年中的每個值,但無法將其分割爲單獨的行。

data$new <- data$val/(data$yr2-data$yr1) 

回答

3

下面是使用從我的「splitstackshape」包expandRows與「data.table一些複合語句沿可能性「:

library(splitstackshape) 
expandRows(
    as.data.table(
    data, keep.rownames = TRUE)[, diff := yr2 - yr1][, 
     val := val/diff], "diff")[, yr1 := yr1 + sequence(.N) - 1L, 
     by = list(group, rn)][, yr2 := yr1 + 1][] 
#  rn group yr1 yr2 val 
# 1: 1  a 1927 1928 -20 
# 2: 1  a 1928 1929 -20 
# 3: 1  a 1929 1930 -20 
# 4: 1  a 1930 1931 -20 
# 5: 1  a 1931 1932 -20 
# 6: 1  a 1932 1933 -20 
# 7: 1  a 1933 1934 -20 
# 8: 2  a 1953 1954 -240 
# 9: 2  a 1954 1955 -240 
# 10: 3  b 1957 1958 -280 
# 11: 4  b 1961 1962 -355 
# 12: 4  b 1962 1963 -355 
# 13: 4  b 1963 1964 -355 
# 14: 4  b 1964 1965 -355 

相較於@初學者的方法,THI s更有效率,但純粹的「data.table」方法甚至更快。

下面是關於剛剛1000行的比較:

功能....

beginneR <- function() { 
    data %>% 
    rowwise %>% 
    do(data.frame(group = .$group, 
        yr1 = .$yr1:(.$yr2-1), 
        yr2 = (.$yr1+1):.$yr2, 
        val = .$val/(.$yr2 - .$yr1), stringsAsFactors = FALSE)) 
} 

ananda <- function() { 
    expandRows(
    as.data.table(
     data, keep.rownames = TRUE)[, diff := yr2 - yr1][, 
     val := val/diff], "diff")[, yr1 := yr1 + sequence(.N) - 1L, 
      by = list(group, rn)][, yr2 := yr1 + 1][] 
} 

codoremifa <- function() { 
    as.data.table(data)[,SNO := .I][, 
    val := val/(yr2 - yr1)][, 
     list(yr = yr1:(yr2-1), val), by = list(group,SNO)][, 
     SNO := NULL][, yr2 := yr + 1][] 
} 

定時....

data <- do.call(rbind, replicate(250, data, FALSE)) 
dim(data) 
# [1] 1000 4 
system.time(beginneR()) 
# |====================================|100% ~0 s remaining 
# user system elapsed 
# 2.408 0.000 2.297 
system.time(ananda()) 
# user system elapsed 
# 0.000 0.000 0.017 

library(microbenchmark) 
microbenchmark(ananda(), codoremifa()) 
# Unit: milliseconds 
#   expr  min  lq  mean median  uq  max neval 
#  ananda() 16.791794 17.048305 18.096050 17.786861 18.537067 22.34243 100 
# codoremifa() 8.018706 8.201175 8.649698 8.406204 8.649132 13.87685 100 
+0

謝謝。但我得到一個錯誤'無法找到函數「expandRows」'。我在MacOs X 10.8.5和使用R.3.1.1。我用命令'install.packages(「splitstackshape」)' – nebuloso 2014-12-02 13:33:00

+0

安裝了這個軟件包,我沒有得到它的工作。我運行了splitstacks軟件包版本1.2.0和data.table 1.9.2,並從源代碼重新安裝了這兩個軟件包。 – nebuloso 2014-12-02 13:47:48

+0

尼斯的答案,尤其是與比較(如預期的那樣)(+1) – 2014-12-02 13:53:50

2

可能不是最有效的解決方案,但它產生所需的輸出:

library(dplyr) 

data %>% 
    rowwise %>% 
    do(data.frame(group = .$group, 
       yr1 = .$yr1:(.$yr2-1L), 
       yr2 = (.$yr1+1L):.$yr2, 
       val = .$val/(.$yr2 - .$yr1), stringsAsFactors = FALSE)) 

#Source: local data frame [14 x 4] 
#Groups: <by row> 
# 
# group yr1 yr2 val 
#1  a 1927 1928 -20 
#2  a 1928 1929 -20 
#3  a 1929 1930 -20 
#4  a 1930 1931 -20 
#5  a 1931 1932 -20 
#6  a 1932 1933 -20 
#7  a 1933 1934 -20 
#8  a 1953 1954 -240 
#9  a 1954 1955 -240 
#10  b 1957 1958 -280 
#11  b 1961 1962 -355 
#12  b 1962 1963 -355 
#13  b 1963 1964 -355 
#14  b 1964 1965 -355 
+0

@AnandaMahto,你絕對正確 - 感謝和編輯。 – 2014-12-02 13:48:18

5
library(data.table) 
setDT(data) 
data[,SNO := .I] 
data[,val := val/(yr2 - yr1)] 
(data[, 
    list(yr = yr1:(yr2-1), val), 
    by = list(group,SNO) 
    ][, 
     SNO := NULL 
     ][, 
     yr2 := yr + 1] 

) 

輸出

#  group yr val yr2 
# 1:  a 1927 -20 1928 
# 2:  a 1928 -20 1929 
# 3:  a 1929 -20 1930 
# 4:  a 1930 -20 1931 
# 5:  a 1931 -20 1932 
# 6:  a 1932 -20 1933 
# 7:  a 1933 -20 1934 
# 8:  a 1953 -240 1954 
# 9:  a 1954 -240 1955 
# 10:  b 1957 -280 1958 
# 11:  b 1961 -355 1962 
# 12:  b 1962 -355 1963 
# 13:  b 1963 -355 1964 
# 14:  b 1964 -355 1965 
+0

這不完全正確,它在每個組的末尾添加了一行太多! – nebuloso 2014-12-02 13:33:34

+1

@nebuloso,這是一個簡單的解決方法。只需更改'yr1:yr2 '到'yr1:(yr2-1)'應該這樣做。 – A5C1D2H2I1M1N2O1R2T1 2014-12-02 13:55:47

+0

@AnandaMahto,看到這個包含在你的基準測試中會很有趣 – 2014-12-02 13:57:32