2017-07-06 131 views
0

大數據一個數據幀的多行我R.是比較新我有一個數據幀test,看起來像這樣:重新排列中的R

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
LID 
STAT 
MH 
PMID # id 
OT 
PST  # cue 
LID 
DEP 
RN 
PMID # id 
PST  # cue 

,我希望它看起來像這樣:

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
PMID # id 
LID 
STAT 
MH 
OT 
PST  # cue 
PMID # id 
LID 
DEP 
RN 
PST  # cue 

基本上,我希望PMID之後的條目適用於特定的PMID,第一個PMID就是這種情況。但是,在第一個PMID之後,PMID隨機地位於其條目之間。但是,每個PMID都以PST結束,所以我希望在第一個PMID在上一個PST位置之後移動到該位置。我有兩個數據幀包含每個PMID和PST的索引位置。因此,例如,對於PMID,DF a_new包含

1 
11 
17 

和PST,DF b包含

7 
13 
18 

這是我已經盡力了,但因爲我有超過24萬行,這沒」 T結束後的運行時間,當我停止了它,我的數據幀並沒有改變:

for (i in 1:nrow(test)) 
{  
    if (i %in% a_new$X1) # if it's a PMID 
    { 
    entry <- match(i, a_new$X1) # find entry index of PMID 
    if (entry != 1) # as long as not first row from a_new (that's corrected) 
    { 
     r <- b[i, 1] # row of PST 
     test <- rbind(test[1:r, ], test[entry, 1], test[-(1:r), ]) 
     test <- test[-c(i+1), ] # remove duplicate PMID 
    } 
    } 
} 

正如你可以看到,rbind會在極在這種情況下高效。請指教。

+0

'test'看起來不像'data.frame':它沒有列名和行號 – HubertL

+0

它是2400萬個觀察值/行和1列 – sweetmusicality

+0

我不知道如何在列中添加列和行數stackoverflow(沒有它手動) – sweetmusicality

回答

2

下面是使用data.table一個答案。

library(data.table) 

dat <- fread("Origcol 
      PMID 
      LID 
      STAT 
      MH 
      RN 
      OT 
      PST  
      LID 
      STAT 
      MH 
      PMID  
      OT 
      PST  
      LID 
      DEP 
      RN 
      PMID 
      PST") 

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)] 

結果:

Origcol old_order grp 
1: PMID   1 1 
2:  LID   2 1 
3: STAT   3 1 
4:  MH   4 1 
5:  RN   5 1 
6:  OT   6 1 
7:  PST   7 1 
8: PMID  11 2 
9:  LID   8 2 
10: STAT   9 2 
11:  MH  10 2 
12:  OT  12 2 
13:  PST  13 2 
14: PMID  17 3 
15:  LID  14 3 
16:  RN  16 3 
17:  DEP  15 3 
18:  PST  18 3 

這樣做的好處是data.table通過引用做了很多的操作,一旦你擴大規模要快。你說你有1400萬行,讓我們試試看。產生這種規模的一些合成數據:

dat_big <- data.table(Origcol = c("PMID", "LID", "STAT", "MH", "RN", "OT", "PST")) 
dat_big_add <- rbindlist(lapply(1:10000, 
           function(x) data.table(Origcol = c(sample(c("PMID", "LID", "STAT", 
                      "MH", "RN", "OT")), 
                    "PST")))) 
dat_big <- rbindlist(list(dat_big, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add)) 

dat <- rbindlist(list(dat_big, dat_big, dat_big, dat_big, dat_big, 
         dat_big, dat_big, dat_big, dat_big, dat_big)) 

我們現在有:

  Origcol 
     1: PMID 
     2:  LID 
     3: STAT 
     4:  MH 
     5:  RN 
     ---   
14000066: STAT 
14000067:  MH 
14000068:  OT 
14000069: PMID 
14000070:  PST 

應用與上面相同的代碼:

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)] 

現在,我們得到:

  Origcol old_order  grp 
     1: PMID   1  1 
     2:  LID   2  1 
     3: STAT   3  1 
     4:  MH   4  1 
     5:  RN   5  1 
     ---       
14000066: STAT 14000066 2000010 
14000067:  MH 14000067 2000010 
14000068:  RN 14000064 2000010 
14000069:  OT 14000068 2000010 
14000070:  PST 14000070 2000010 

需要多長時間?

library(microbenchmark) 
microbenchmark(
    "data.table" = { 
    dat[, old_order := 1:.N] 
    pst_index <- c(0, which(dat$Origcol == "PST")) 
    dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
           function(x) rep(x, 
               times = (pst_index[x+1] - pst_index[x]))))] 
    dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
               "MH", "RN", "OT", 
               "DEP", "PST"))] 
    dat[order(grp, Origcol)] 
    }, 
    times = 10) 

而且它需要:

Unit: seconds 
     expr  min  lq  mean median  uq  max neval 
data.table 5.755276 5.813267 6.059665 5.87151 6.034506 7.310169 10 

在10秒1400萬行。生成測試數據花了很長時間。

+0

哇,謝謝你這樣一個徹底的答案!你的解釋看起來很有希望。然而,在運行'test [,grp:= unlist ....'行時,我遇到了這個錯誤:'rep in error(x,time =(pst_index [x + 1] - pst_index [x])) 無效的'times'參數' – sweetmusicality

+0

對於'test'數據集有什麼不同嗎?它是否會在您的計算機上按原樣複製我的代碼失敗? –

+0

哦,你的代碼工作得很好,原樣複製。我的實際數據集中有一些行開始相同(你會在下一句中看到我的意思),並且是相互連續的。而且,每行不只是一個單詞 - 例如,「PMID - 234254」或「MH - 人類」,但我不知道爲什麼會影響錯誤。在看到你的代碼後,我使用'setDT(df)'將數據框更改爲data.table ...是否爲適當的響應? – sweetmusicality

1

這是一個使用which的索引方法。

# get positions of PST, the final value 
endSpot <- which(temp == "PST") 
# increment to get the desired positions of the PMID 
# (dropping final value as we don't need to change it) 
startSpot <- head(endSpot + 1, -1) 
# get the current positions of the PMID, except the first one 
PMIDSpot <- tail(which(temp == "PMID"), -1) 

現在,用這些指標來交換行

temp[c(startSpot, PMIDSpot), ] <- temp[c(PMIDSpot, startSpot), ] 

這將返回(我增加了一個叫做計數行位置變量來跟蹤)。

temp 
    V1 count 
1 PMID  1 
2 LID  2 
3 STAT  3 
4 MH  4 
5 RN  5 
6 OT  6 
7 PST  7 
8 PMID 11 
9 STAT  9 
10 MH 10 
11 LID  8 
12 OT 12 
13 PST 13 
14 PMID 17 
15 DEP 15 
16 RN 16 
17 LID 14 
18 PST 18 

數據

temp <- 
structure(list(V1 = c("PMID", "LID", "STAT", "MH", "RN", "OT", 
"PST", "LID", "STAT", "MH", "PMID", "OT", "PST", "LID", "DEP", 
"RN", "PMID", "PST"), count = 1:18), .Names = c("V1", "count" 
), row.names = c(NA, -18L), class = "data.frame")