R聚集在部分列表中

我有一個很大的數據框（base_cov_norm_compl_taxid3），每一行代表一個基因組區域，每一列代表樣本中該區域的覆蓋範圍。每個taxid有多個基因組區域（類似於基因組），我想使用聚合來找到相同類型的所有基因組區域的手段和sd等。R聚集在部分列表中

base_cov_norm_compl_taxid3[1:10,1:10] 
           geneid_stst attr taxid 
1 1001585.66299.NC_015410_1089905_1090333 mrkg 1001585 
2 1001585.66299.NC_015410_1090348_1090740 mrkg 1001585 
3 1001585.66299.NC_015410_1215751_1216851 mrkg 1001585 
4 1001585.66299.NC_015410_2346036_2347421 mrkg 1001585 
5 1001585.66299.NC_015410_2354962_2429569 PFPR 1001585 
6 1001585.66299.NC_015410_2610633_2611913 mrkg 1001585 
7 1001585.66299.NC_015410_3224232_3225248 mrkg 1001585 
8 1001585.66299.NC_015410_3682375_3683115 mrkg 1001585 
9 1001585.66299.NC_015410_4101816_4103195 mrkg 1001585 
10 1001585.66299.NC_015410_4141587_4142873 mrkg 1001585 
         locus X765560005.stool1 X764224817.stool1  MH0008 
1 NC_015410_1089905_1090333     0     0 0.0000000000 
2 NC_015410_1090348_1090740     0     0 0.0000000000 
3 NC_015410_1215751_1216851     0     0 0.0000000000 
4 NC_015410_2346036_2347421     0     0 0.0281385281 
5 NC_015410_2354962_2429569     0     0 0.0005361355 
6 NC_015410_2610633_2611913     0     0 0.0000000000 
7 NC_015410_3224232_3225248     0     0 0.0000000000 
8 NC_015410_3682375_3683115     0     0 0.0000000000 
9 NC_015410_4101816_4103195     0     0 0.0000000000 
10 NC_015410_4141587_4142873     0     0 0.0000000000 
     V1.CD9.0 X764062976.stool1 X160643649.stool1 
1 0.0000000000     0     0 
2 0.0000000000     0     0 
3 0.0000000000     0     0 
4 0.0000000000     0     0 
5 0.0004557152     0     0 
6 0.0000000000     0     0 
7 0.0000000000     0     0 
8 0.0000000000     0     0 
9 0.0000000000     0     0 
10 0.0000000000     0     0

有時總是的mrkg類型的多個基因組區域，並且有時有每基因組的多個區域PFPR。我想彙總taxid和attr，但只限於attr=mrkg。我不知道該怎麼做。下面的代碼按taxid和attr聚合，但我想先寫list(base_cov_norm_compl_taxid3$taxid,base_cov_norm_compl_taxid3$attr=mrkg)或某個子集？

讚賞任何幫助，

base_cov_mean<-aggregate(base_cov_norm_compl_taxid3[,5:266], 
    list(base_cov_norm_compl_taxid3$taxid, 
    base_cov_norm_compl_taxid3$attr),mean)

來源

2012-04-17 user1249760

subdf <- subset(base_cov_norm_compl_taxid3, attr %in% "mrkg") 
base_cov_mean <- with(subdf, aggregate(subdf[5:266], 
            by=list(taxid, attr), 
            FUN=mean) 
         )

我沒有使用attr == "mrkg"，因爲它並不能一概而論爲好。

來源

2012-04-17 12:19:51

感謝 - 是有道理的，但我得到這個錯誤「FUN（X [[1L]]，...）中的錯誤：參數必須具有相同的長度」 – user1249760 2012-04-17 12:26:10

對不起，忘記縮短副向量。看看修補程序是否會起作用。 – 2012-04-17 12:27:48

我沒有看到修復 – user1249760 2012-04-17 12:34:29

你可以使用data.table

它優化了mean所以會非常快。
在呼叫中定義subset也很容易。
它的@迪文的with解決方案的所有優點，不必有很多$的

這裏的污染代碼是一個示例

library(data.table) 
DT <- data.table(base_cov_norm_compl_taxid3) 
# the columns of which you want the eman 
columns_of_interest <- names(DT)[5:266] 
DT[attr %in% 'mrkg', lapply(.SD, mean), by = list(taxid, attr), .SDcols = columns_of_interest]

來源

2012-09-10 23:24:55 mnel

R聚集在部分列表中

回答

相關問題