重新排列中的R

大數據一個數據幀的多行我R.是比較新我有一個數據幀test，看起來像這樣：重新排列中的R

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
LID 
STAT 
MH 
PMID # id 
OT 
PST  # cue 
LID 
DEP 
RN 
PMID # id 
PST  # cue

，我希望它看起來像這樣：

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
PMID # id 
LID 
STAT 
MH 
OT 
PST  # cue 
PMID # id 
LID 
DEP 
RN 
PST  # cue

基本上，我希望PMID之後的條目適用於特定的PMID，第一個PMID就是這種情況。但是，在第一個PMID之後，PMID隨機地位於其條目之間。但是，每個PMID都以PST結束，所以我希望在第一個PMID在上一個PST位置之後移動到該位置。我有兩個數據幀包含每個PMID和PST的索引位置。因此，例如，對於PMID，DF a_new包含

1 
11 
17

和PST，DF b包含

7 
13 
18

這是我已經盡力了，但因爲我有超過24萬行，這沒」 T結束後的運行時間，當我停止了它，我的數據幀並沒有改變：

for (i in 1:nrow(test)) 
{  
    if (i %in% a_new$X1) # if it's a PMID 
    { 
    entry <- match(i, a_new$X1) # find entry index of PMID 
    if (entry != 1) # as long as not first row from a_new (that's corrected) 
    { 
     r <- b[i, 1] # row of PST 
     test <- rbind(test[1:r, ], test[entry, 1], test[-(1:r), ]) 
     test <- test[-c(i+1), ] # remove duplicate PMID 
    } 
    } 
}

正如你可以看到，rbind會在極在這種情況下高效。請指教。

來源

2017-07-06 sweetmusicality

'test'看起來不像'data.frame'：它沒有列名和行號 – HubertL

它是2400萬個觀察值/行和1列 – sweetmusicality

我不知道如何在列中添加列和行數stackoverflow（沒有它手動） – sweetmusicality

下面是使用data.table一個答案。

library(data.table) 

dat <- fread("Origcol 
      PMID 
      LID 
      STAT 
      MH 
      RN 
      OT 
      PST  
      LID 
      STAT 
      MH 
      PMID  
      OT 
      PST  
      LID 
      DEP 
      RN 
      PMID 
      PST") 

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)]

結果：

Origcol old_order grp 
1: PMID   1 1 
2:  LID   2 1 
3: STAT   3 1 
4:  MH   4 1 
5:  RN   5 1 
6:  OT   6 1 
7:  PST   7 1 
8: PMID  11 2 
9:  LID   8 2 
10: STAT   9 2 
11:  MH  10 2 
12:  OT  12 2 
13:  PST  13 2 
14: PMID  17 3 
15:  LID  14 3 
16:  RN  16 3 
17:  DEP  15 3 
18:  PST  18 3

這樣做的好處是data.table通過引用做了很多的操作，一旦你擴大規模要快。你說你有1400萬行，讓我們試試看。產生這種規模的一些合成數據：

dat_big <- data.table(Origcol = c("PMID", "LID", "STAT", "MH", "RN", "OT", "PST")) 
dat_big_add <- rbindlist(lapply(1:10000, 
           function(x) data.table(Origcol = c(sample(c("PMID", "LID", "STAT", 
                      "MH", "RN", "OT")), 
                    "PST")))) 
dat_big <- rbindlist(list(dat_big, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add)) 

dat <- rbindlist(list(dat_big, dat_big, dat_big, dat_big, dat_big, 
         dat_big, dat_big, dat_big, dat_big, dat_big))

我們現在有：

  Origcol 
     1: PMID 
     2:  LID 
     3: STAT 
     4:  MH 
     5:  RN 
     ---   
14000066: STAT 
14000067:  MH 
14000068:  OT 
14000069: PMID 
14000070:  PST

應用與上面相同的代碼：

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)]

現在，我們得到：

  Origcol old_order  grp 
     1: PMID   1  1 
     2:  LID   2  1 
     3: STAT   3  1 
     4:  MH   4  1 
     5:  RN   5  1 
     ---       
14000066: STAT 14000066 2000010 
14000067:  MH 14000067 2000010 
14000068:  RN 14000064 2000010 
14000069:  OT 14000068 2000010 
14000070:  PST 14000070 2000010

需要多長時間？

library(microbenchmark) 
microbenchmark(
    "data.table" = { 
    dat[, old_order := 1:.N] 
    pst_index <- c(0, which(dat$Origcol == "PST")) 
    dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
           function(x) rep(x, 
               times = (pst_index[x+1] - pst_index[x]))))] 
    dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
               "MH", "RN", "OT", 
               "DEP", "PST"))] 
    dat[order(grp, Origcol)] 
    }, 
    times = 10)

而且它需要：

Unit: seconds 
     expr  min  lq  mean median  uq  max neval 
data.table 5.755276 5.813267 6.059665 5.87151 6.034506 7.310169 10

在10秒1400萬行。生成測試數據花了很長時間。

來源

2017-07-06 19:04:27

哇，謝謝你這樣一個徹底的答案！你的解釋看起來很有希望。然而，在運行'test [，grp：= unlist ....'行時，我遇到了這個錯誤：'rep in error（x，time =（pst_index [x + 1] - pst_index [x]））無效的'times'參數' – sweetmusicality

對於'test'數據集有什麼不同嗎？它是否會在您的計算機上按原樣複製我的代碼失敗？ –

哦，你的代碼工作得很好，原樣複製。我的實際數據集中有一些行開始相同（你會在下一句中看到我的意思），並且是相互連續的。而且，每行不只是一個單詞 - 例如，「PMID - 234254」或「MH - 人類」，但我不知道爲什麼會影響錯誤。在看到你的代碼後，我使用'setDT（df）'將數據框更改爲data.table ...是否爲適當的響應？ – sweetmusicality

這是一個使用which的索引方法。

# get positions of PST, the final value 
endSpot <- which(temp == "PST") 
# increment to get the desired positions of the PMID 
# (dropping final value as we don't need to change it) 
startSpot <- head(endSpot + 1, -1) 
# get the current positions of the PMID, except the first one 
PMIDSpot <- tail(which(temp == "PMID"), -1)

現在，用這些指標來交換行

temp[c(startSpot, PMIDSpot), ] <- temp[c(PMIDSpot, startSpot), ]

這將返回（我增加了一個叫做計數行位置變量來跟蹤）。

temp 
    V1 count 
1 PMID  1 
2 LID  2 
3 STAT  3 
4 MH  4 
5 RN  5 
6 OT  6 
7 PST  7 
8 PMID 11 
9 STAT  9 
10 MH 10 
11 LID  8 
12 OT 12 
13 PST 13 
14 PMID 17 
15 DEP 15 
16 RN 16 
17 LID 14 
18 PST 18

數據

temp <- 
structure(list(V1 = c("PMID", "LID", "STAT", "MH", "RN", "OT", 
"PST", "LID", "STAT", "MH", "PMID", "OT", "PST", "LID", "DEP", 
"RN", "PMID", "PST"), count = 1:18), .Names = c("V1", "count" 
), row.names = c(NA, -18L), class = "data.frame")

來源

2017-07-06 18:38:12 lmo

重新排列中的R

回答

相關問題