比較的信息之間的兩個矩陣通過刪除某些行ř

我已經兩個矩陣時，產生出的其他中的一個。例如：比較的信息之間的兩個矩陣通過刪除某些行ř

m = matrix(1:18, 6, 3) 
m1 = m[c(-1, -3, -6),]

想我不知道被淘汰創建M1其中m行，我應該如何找到它通過比較兩個矩陣？我想要的結果如下所示：

1, 3, 6

我正在處理的實際矩陣非常大。我想知道是否有任何有效的方法來執行它。

來源

2017-06-07 user7453767

一種可能的方式是代表每一行作爲一個字符串：

x1 <- apply(m, 1, paste0, collapse = ';') 
x2 <- apply(m1, 1, paste0, collapse = ';') 
which(!x1 %in% x2) 
# [1] 1 3 6

用我的解決方案和G. Grothendieck's solutions一個大型矩陣的一些基準：

set.seed(123) 
m <- matrix(rnorm(20000 * 5000), nrow = 20000) 
m1 <- m[-sample.int(20000, 1000), ] 

system.time({ 
    which(tail(!duplicated(rbind(m1, m)), nrow(m))) 
}) 
# user system elapsed 
# 339.888 2.368 342.204 
system.time({ 
    x1 <- apply(m, 1, paste0, collapse = ';') 
    x2 <- apply(m1, 1, paste0, collapse = ';') 
    which(!x1 %in% x2) 
}) 
# user system elapsed 
# 395.428 0.568 395.955 

system({ 
    n <- nrow(m); n1 <- nrow(m1) 
    tm <- t(m); tm1 <- t(m1) 

    match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1) 
    setdiff(1:n, unlist(lapply(1:n1, match_indexes))) 
}) 
# > 15 min, not finish 


system({ 
    i <- interaction(as.data.frame(m)) 
    i1 <- interaction(as.data.frame(m1)) 
    match(setdiff(i, i1), i) 
}) 
# run out of memory. My 32G RAM machine crashed.

來源

2017-06-07 02:39:24 mt1022

非常感謝您！但是我的矩陣m實際上是一個14290行和4413個項的文檔項矩陣。這個方法能處理這麼大的矩陣嗎？ – user7453767

@ user7453767，這個大矩陣非常慢。我做了一個測試例子，並在幾分鐘前運行它。它尚未完成。 – mt1022

這非常有幫助！謝謝@ mt1022 – user7453767

我們也可以使用do.call

which(!do.call(paste, as.data.frame(m)) %in% do.call(paste, as.data.frame(m1))) 
#[1] 1 3 6

來源

2017-06-07 03:56:30 akrun

這裏有一些方法：

1）如果我們假設有在m沒有重複的行 - 這是問題的例子的情況下 - 那麼：

which(tail(!duplicated(rbind(m1, m)), nrow(m))) 
## [1] 1 3 6

2）移調m和m1給tm和tm1，因爲它是更有效地在列上工作比行。

定義match_indexes(i)它返回一個向量r，使得m[r, ]中的每一行匹配m1[i, ]。

應用，爲每個i在1：n1和除去從1結果：N。

n <- nrow(m); n1 <- nrow(m1) 
tm <- t(m); tm1 <- t(m1) 

match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1) 
setdiff(1:n, unlist(lapply(1:n1, match_indexes))) 
## [1] 1 3 6

3）計算的相互作用向量每個矩陣，然後使用setdiff最後match得到的索引：

i <- interaction(as.data.frame(m)) 
i1 <- interaction(as.data.frame(m1)) 
match(setdiff(i, i1), i) 
## [1] 1 3 6

新增如果可以在m然後重複（1 ）和（3）將只返回第一任何乘法發生行的m不m1。

m <- matrix(1:18, 6, 3) 
m1 <- m[c(2, 4, 5),] 
m <- rbind(m, m[1:2, ]) 
# 1 
which(tail(!duplicated(rbind(m1, m)), nrow(m))) 
## 1 3 6 

# 2 
n <- nrow(m); n1 <- nrow(m1) 
tm <- t(m); tm1 <- t(m1) 
match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1) 
setdiff(1:n, unlist(lapply(1:n1, match_indexes))) 
## 1 3 6 7 

# 3 
i <- interaction(as.data.frame(m)) 
i1 <- interaction(as.data.frame(m1)) 
match(setdiff(i, i1), i) 
## 1 3 6

來源

2017-06-07 04:07:59

我更喜歡第一個。它也更快。在我的答案中看到基準。 – mt1022

第一個很棒。但不幸的是我的矩陣本身有重複的行。不過，我喜歡你在這裏介紹的新穎方法。謝謝！ – user7453767

其實我們可以放鬆一下..如果m中有重複的行，那麼也可以。如果m中有重複的行不在m1中，那麼只有每個多重發生行中的第一行將包含在輸出向量中。這足夠好嗎？ –

比較的信息之間的兩個矩陣通過刪除某些行ř

回答

相關問題