這個r代碼爲什麼這麼慢？

我想創建一個基於另一個數據框中的信息的數據幀。這個r代碼爲什麼這麼慢？

第一數據框（base_mar_bop）的數據，如：

201301|ABC|4 
201302|DEF|12

我的願望是在它創建從該數據幀與16行：

4 times: 201301|ABC|1 
12 times: 201302|DEF|1

我寫了一個腳本，需要長時間運行。爲了得到一個想法，最終的數據幀有大約200萬行，源數據幀大約有10k行。由於數據的機密性，我無法發佈數據幀的源文件。

因爲它經歷了千百年來運行這段代碼，我決定做這在PHP和它一分鐘內跑了，並得到了工作完成後，將其寫入到一個txt文件，然後在R.

導入txt文件

我不知道爲什麼R需要這麼長時間..是否調用函數？它是嵌套for循環嗎？從我的角度來看，那裏沒有那麼多計算密集的步驟。

# first create an empty dataframe called base_eop that will each subscriber on a row 

identified by CED, RATEPLAN and 1 
# where 1 is the count and the sum of 1 should end up with the base 
base_eop <-base_mar_bop[1,] 

# let's give some logical names to the columns in the df 
names(base_eop) <- c('CED','RATEPLAN','BASE') 


# define the function that enables us to insert a row at the bottom of the dataframe 
insertRow <- function(existingDF, newrow, r) { 
    existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),] 
    existingDF[r,] <- newrow 
    existingDF 
} 


# now loop through the eop base for march, each row contains the ced, rateplan and number of subs 
# we need to insert a row for each individual sub 
for (i in 1:nrow(base_mar_eop)) { 
    # we go through every row in the dataframe 
    for (j in 1:base_mar_eop[i,3]) { 
    # we insert a row for each CED, rateplan combination and set the base value to 1 
    base_eop <- insertRow(base_eop,c(base_mar_eop[i,1:2],1),nrow(base_eop)) 
    } 
} 

# since the dataframe was created using the first row of base_mar_bop we need to remove this first row 
base_eop <- base_eop[-1,]

來源

2013-04-24 Geoffrey Stoel

你會好得多提前定義整個數據框，然後填寫它而不是附加行。我認爲這是在Pat Burns的「R Inferno」中討論的。另外考慮使用'data.table'包進行這種大型操作。 – 2013-04-24 21:45:36

提供了一個小的（真的很小，你可以放在上面的代碼中）可重現的示例數據集 – eddi 2013-04-24 21:45:42

如果輸出示例中的第二行是'201302 | DEF | 1'（即1而不是12）？ – 2013-04-24 21:46:42

我還沒有嘗試過任何基準呢，但這種方法（在你的小例子所示）應多更快：

d <- data.frame(x1=c(201301,201302),x2=c("ABC","DEF"),rep=c(4,12)) 
with(d,data.frame(x1=rep(x1,rep),x2=rep(x2,rep),rep=1))

稍微更現實的例子，有時機：

d2 <- data.frame(CED=1:10000,RATEPLAN=rep(LETTERS[1:25], 
     length.out=10000),BASE=200) 
nrow(d2) ## 10000 
sum(d2$BASE) ## 2e+06 
system.time(d3 <- with(d2, 
     data.frame(CED=rep(CED,BASE),RATEPLAN=rep(RATEPLAN,BASE), 
       BASE=1))) 
## user system elapsed 
## 0.244 0.860 1.117 
nrow(d3) ## 2000000 (== 2e+06)

來源

2013-04-24 21:49:32

明天我會在我的源文件存儲處嘗試一下。 – 2013-04-24 21:54:03

+1，但我很想看到某人的data.table解決方案。我仍然無法理解 – 2013-04-24 22:19:36

+1的語法。令人敬畏的速度提升。 @ SimonO101，我不確定'data.table'是否會比Ben的方法更快。他們不是在回答我的嘗試，但我可能不會非常有效地使用'data.table'。但是，語法更加緊湊，當你談論你是否需要等待100毫秒或更少時，那麼我認爲有關速度的討論有點愚蠢:) – A5C1D2H2I1M1N2O1R2T1 2013-04-25 04:35:55

這裏是data.table一種方法，雖然@ BenBolker的時機已經真棒。

library(data.table) 
DT <- data.table(d2) ## d2 from @BenBolker's answer 
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)] 
out 
#   CED RATEPLAN BASE 
#  1:  1  A 1 
#  2:  1  A 1 
#  3:  1  A 1 
#  4:  1  A 1 
#  5:  1  A 1 
#  ---      
# 1999996: 10000  Y 1 
# 1999997: 10000  Y 1 
# 1999998: 10000  Y 1 
# 1999999: 10000  Y 1 
# 2000000: 10000  Y 1

這裏，我使用的化合物查詢做到以下幾點：

創建一個ID變量，實際上它僅僅1在data.table的行數。
使用rep以對應的BASE值重複ID變量。
用「1」取代所有BASE值並刪除我們之前創建的ID變量。

也許有一個更有效的方法來做到這一點，雖然。例如，刪除其中一個複合查詢應該使其更快一些。也許像這樣：

out <- DT[rep(1:nrow(DT), BASE)][, BASE:=1]

來源

2013-04-25 04:27:40 A5C1D2H2I1M1N2O1R2T1

另一種皮膚貓的方式似乎稍快：'DT [DT [，rep（.I，BASE）]] [，BASE：= 1]' – eddi 2013-04-25 04:54:27

@eddi，好主意。基本上'data.table'合併... – A5C1D2H2I1M1N2O1R2T1 2013-04-25 05:44:15

+1我看到data.table解決方案的動力不是速度，而是試圖理解語法是如何工作的。 Bens方法對我來說更加直觀，但複合查詢似乎*非常強大，只要我能*得到它*！ – 2013-04-25 05:57:19

這個r代碼爲什麼這麼慢？

回答

相關問題