2015-04-05 105 views
3

我有一個數據框如下(稱爲DAT)獲取頻率

chr chrStart chrEnd Gene RChr RStart REnd Rname distance 
chr1 39841 39883 Gene1 chr1 398  3984 Cha1b 0 
chr1 39841 39883 Gene1 chr1 398  3985 Ab  0 
chr1 39841 39883 Gene1 chr1 398  3986 Tia  0 
chr1 39841 39883 Gene1 chr1 398  3987 MEA  0 
chr1 39841 39883 Gene1 chr1 398  3988 La  0 
chr1 39841 39883 Gene1 chr1 398  3989 M3  0 
chr1 14893 15893 Gene2 chr1 398  3984 Cha1b 0 
chr1 14893 15893 Gene2 chr1 398  3985 Cha1b 0 
chr1 14893 15893 Gene2 chr1 398  3986 Cha1b 0 
chr1 14893 15893 Gene2 chr1 398  3987 MEA  0 
chr1 14893 15893 Gene2 chr1 398  3988 MEA  0 
chr1 39841 39883 Gene1 chr1 398  3989 M3  0 

我想要得到的是不同類型的RNAME出現每個基因所以上述結果應該是頻率像

Gene  Rname  Freq 
Gene1 Cha1b  1 
Gene1  Ab  1 
Gene1  Tia  1 
Gene1  MEA  1 
Gene1  La   1 
Gene1  M3   1 
Gene2 Cha1b  3 
Gene2 MEA   2 
Gene2  M3   1 

我試着做了兩個集團與dplyr但我認爲這是沒有意義的,反正它只是給我所有的Rnames的每個基因

頻率210
+0

'base R' option is'subset(as.data.frame(table(dat [c('Gene','Rname')])),Freq!= 0)' – akrun 2015-04-05 12:22:09

回答

3

您應該使用n()(因爲您無法對非數值進行求和)以計算觀察值,並且可以一次對兩個變量進行分組。

dat %>% 
    group_by(Gene, Rname) %>% 
    summarise(freq = n()) 

# Source: local data frame [8 x 3] 
# Groups: Gene 
# 
# Gene Rname freq 
# 1 Gene1 Ab 1 
# 2 Gene1 Cha1b 1 
# 3 Gene1 La 1 
# 4 Gene1 M3 2 
# 5 Gene1 MEA 1 
# 6 Gene1 Tia 1 
# 7 Gene2 Cha1b 3 
# 8 Gene2 MEA 2 

或者使用tally作爲

dat %>% 
    group_by(Gene, Rname) %>% 
    tally() 

或(由@hrbrmstr所建議的),你可以跳過分組步驟中使用count

dat %>% 
    count(Gene, Rname) 
+0

如果我想把它放入格式,以便我有沿着行的基因名稱和沿列的Rname我將如何做到這一點(如果有必要,請高興地提出一個單獨的問題) – 2015-04-05 12:29:00

+1

@ user362206只需在評論中使用'table'即可,或者您可能需要'spread ''從'tidyr'或'dcast'從'reshape2' – akrun 2015-04-05 12:31:38

3

您可以嘗試data.table

library(data.table) 
setDT(dat)[,list(count=.N), list(Gene, Rname)] 

# Gene Rname count 
#1: Gene1 Cha1b  1 
#2: Gene1 Ab  1 
#3: Gene1 Tia  1 
#4: Gene1 M3  2 
#5: Gene2 Cha1b  3 
#6: Gene2 MEA  2 
#7: Gene1 MEA  1 
#8: Gene1 La  1 
+0

這個也給了我想要的,但是決定去上面的那個 – 2015-04-05 12:23:50

+1

沒問題!如果你喜歡dplyr,當然可以隨意使用;) – 2015-04-05 12:25:21