查找匹配並追加到data.frame的更快方法？

我有代碼工作。但速度很慢，我希望能夠加快速度，這樣我就可以擴展到幾十萬個觀察值的數據集。查找匹配並追加到data.frame的更快方法？

我有兩個數據幀，其中一個我使用data.table包轉換爲data.table包以便快速查找和連接。當3個字段與第二個數據集中的記錄相匹配時，我想記錄一個數據集中的記錄。

Original.df（數據框）和LookHereForMatches.dt（帶有a1，a2，a3上的鍵的data.table）。 Original.df將有10萬到30萬的觀察值，LookHereForMatches.dt可能會是2倍。

我循環遍歷Original.df中的每個觀察值，並在LookHereForMatches.dt中查找與某些條件匹配的觀察值。我需要從LookHereForMatches.dt中的幾個字段和來自Original.df的幾個字段。我使用subset（）來獲得我想要的列。

也許有人可以告訴我，我的代碼的哪一部分是最差/最慢的。我必須相信它是rbind（cbind（））的一部分。似乎並不是這樣做的正確方法。

matched_data.df <- data.frame() 
for(i in 1:nrow(Original.df)){ 
    a1 <- Original.df$col1 
    a2 <- Original.df$col2 
    a3 <- Original.df$col3 
    # Use data.table library "join" functionality to get matches (will find at least 1 and up to 4 matches, usually only 1 or 2) 
    match.df <- data.frame(LookHereForMatches.dt[J(a1, a2, a3)], stringsAsFactors=FALSE) 

    # combine matches with original data and add to data.frame to create big list of data with matches 
    matched_data.df <- rbind(cbind(match.df, Original.df[i,], stringsAsFactors=FALSE), matched_data.df) 
}

UPDATE

這裏是大致的數據是什麼樣子。（顯然是R和StackExchange上的新手 ~~我會弄清楚如何使表更漂亮並回來解決這個問題。~~ 感謝@joran修復我的表。）表是非常基本的東西。我只想查找第一個表中的每一行，並將其與第一個表a1，a2和a3中所有適當的行進行匹配。在該示例中，來自Original.df的第一行應該與返回3行的LookHereForMatches.dt表中的行1，行2和行3配對。

Original.df <- read.table(textConnection(' 
a1 a2 a3 text.field numeric.field 
123 abc 2011-12-01 "some text" 1.0 
124 abc 2011-11-12 "some other text" 0.1 
125 bcd 2011-12-01 "more text" 1.2 
'), header=TRUE) 

LookHereForMatches.df <- read.table(textConnection(' 
a1 a2 a3 text.field numeric.field Status_Ind 
123 abc 2011-12-01 "some text" 10.5 0 
123 abc 2011-12-01 "different text" 0.1 1 
123 abc 2011-12-01 "more text" 0.1 1 
125 bcd 2011-12-01 "other text" 4.3 0 
125 bcd 2011-12-01 "text"  2.2 0 
'), header=TRUE) 

LookHereForMatches.dt <- data.table(LookHereForMatches.df, key=c("a1","a2","a3"))

來源

2012-04-23 user791770

因爲我不知道你的數據是什麼樣子，原諒我，如果這沒有幫助......如果你能提供數據的一個小樣本，您將得到更好的答案。但是，你不能使用像條件匹配的東西嗎？ 'Origional.df [Origion.df $ a1％in％LookHereForMatchers.dt $ a1＆Origional.df $ a2％in％LookHereForMatches.dt $ a2，]'。 'for循環'很慢，但'rbind（cbind（...））'慢得多。理想情況下，您可以在分配之前分配全尺寸的'matched_data.df'。如果你不能，使用像我上面寫的東西應該可以幫助一些...... – Justin 2012-04-23 22:13:49

我不明白（也許是因爲你沒有提供一個可重複的例子？）爲什麼你不能簡單地在data.table之間做一個連接。 – joran 2012-04-23 22:33:41

更新後添加一些示例數據。我會查看％in％。至於我無法在data.tables之間進行連接的原因...我是R新手。我也會考慮加入。 – user791770 2012-04-23 22:38:48

聽起來像merge會做你想要的;詳情請參閱?merge。

> merge(Original.df, LookHereForMatches.df, by=c("a1","a2","a3")) 
    a1 a2   a3 text.field.x numeric.field.x text.field.y 
1 123 abc 2011-12-01 some text    1.0  some text 
2 123 abc 2011-12-01 some text    1.0 different text 
3 123 abc 2011-12-01 some text    1.0  more text 
4 125 bcd 2011-12-01 more text    1.2  other text 
5 125 bcd 2011-12-01 more text    1.2   text 
    numeric.field.y Status_Ind 
1   10.5   0 
2    0.1   1 
3    0.1   1 
4    4.3   0 
5    2.2   0

如果你想要更多的控制，它的使用match幕後，這樣的事情：

a <- with(Original.df, paste(a1, a2, a3, sep="\b")) 
b <- with(LookHereForMatches.df, paste(a1, a2, a3, sep="\b")) 
m <- match(b, a) 
cbind(Original.df[m,], LookHereForMatches.df)

又找了all選項來控制什麼時候的事不同時出現它數據集。

merge(Original.df, LookHereForMatches.df, by=c("a1","a2","a3"), all=TRUE)

至於處理大型數據集的速度，你可以通過使用data.table但在每個1E5和3E5行得到一些加速（如下圖），我的系統上，合併只需要2.6秒和匹配和只需要1.5秒。

set.seed(5) 
N <- 1e5 
Original.df <- data.frame(a1=1:N, a2=1, a3=1, text1=paste("hi",1:N)) 
LookHereForMatches.df <- data.frame(a1=sample(1:N, 3*N, replace=TRUE), 
            a2=1, a3=1, text2=paste("hi", 1:(3*N)))

來源

2012-04-24 01:44:03 Aaron

謝謝。而已。後綴= c（「_ first」，「_ second」）或者與之相近的東西也有助於命名。仍嘗試百分之％。但是這似乎有訣竅。我會在測試時發佈一些時間細節，並讓它們正常工作。 – user791770 2012-04-24 02:52:30

我仍然不完全清楚cbind如何正確匹配行，但由於merge（）讓我得到我需要去的地方，我會堅持。時間： original.df中1,000行; LookHereForMatches.df中的3,000行 merge（）：0.015; for循環：3.1 10,000行; 30,000行 merge（）：0.25;循環74秒。 – user791770 2012-04-24 03:31:47

查找匹配並追加到data.frame的更快方法？

回答

相關問題