2013-04-25 45 views
4

我有距離的二維表中R A data.frame(從CSV進口):如何「平坦」或「崩潰」二維數據幀到R A 1D數據幀?

  CP000036 CP001063  CP001368 
CP000036  0   a   b 
CP001063  a   0   c 
CP001368  b   c   0 

我想 「壓扁」 了。我有一個軸的第一欄的值,其他軸的在第二欄的值,然後在第三欄的距離:

Genome1  Genome2  Dist 
CP000036  CP001063  a 
CP000036  CP001368  b 
CP001063  CP001368  c 

以上是理想的,但它是完全沒有重複使得在輸入矩陣中的每個單元都有它自己的行:

Genome1  Genome2  Dist 
CP000036  CP000036  0 
CP000036  CP001063  a 
CP000036  CP001368  b 
CP001063  CP000036  a 
CP001063  CP001063  0 
CP001063  CP001368  c 
CP001368  CP000036  b 
CP001368  CP001063  c 
CP001368  CP001368  0 

下面是一個例子3x3矩陣,但我的數據集我要大得多(約2000×2000)。我會做這在Excel中,但我需要約3個百萬行的輸出,而Excel的最大值是約1萬元。

這個問題是非常相似的 「如何‘扁平化’或‘崩潰’一2D Excel表格到1D?」 1

+1

as.data.frame.table? – 2013-04-25 17:36:43

回答

3

所以這是一個使用melt從包裝reshape2一個解決辦法:

dm <- 
    data.frame(CP000036 = c("0", "a", "b"), 
       CP001063 = c("a", "0", "c"), 
       CP001368 = c("b", "c", "0"), 
       stringsAsFactors = FALSE, 
       row.names = c("CP000036", "CP001063", "CP001368")) 

# assuming the distance follows a metric we avoid everything below and on the diagonal 
dm[ lower.tri(dm, diag = TRUE) ] <- NA 
dm$Genome1 <- rownames(dm) 

# finally melt and avoid the entries below the diagonal with na.rm = TRUE 
library(reshape2) 
dm.molten <- melt(dm, na.rm= TRUE, id.vars="Genome1", 
        value.name="Dist", variable.name="Genome2") 

print(dm.molten) 
    Genome1 Genome2 Dist 
4 CP000036 CP001063 a 
7 CP000036 CP001368 b 
8 CP001063 CP001368 c 

也許有更好的性能解決方案,但我喜歡這個,因爲它的簡單明瞭。