2017-09-15 84 views
4

我有以下數據框:自動計算彙總統計的數據幀和創建新表

col1 <- c("avi","chi","chi","bov","fox","bov","fox","avi","bov", 
      "chi","avi","chi","chi","bov","bov","fox","avi","bov","chi") 
col2 <- c("low","med","high","high","low","low","med","med","med","high", 
      "low","low","high","high","med","med","low","low","med") 
col3 <- c(0,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,0,1,0) 

test_data <- cbind(col1, col2, col3) 
test_data <- as.data.frame(test_data) 

,我想是這樣的表落得(值是隨機的):

Species Pop.density %Resistance CI_low CI_high Total samples 
avi  low   2.0   1.2  2.2  30 
avi  med   0   0  0.5  20 
avi  high   3.5   2.9  4.2  10 
chi  low   0.5   0.3  0.7  20 
chi  med   2.0   1.9  2.1  150 
chi  high   6.5   6.2  6.6  175 

%阻力欄基於上面的col3,其中1 =耐,0 =不耐。我曾嘗試以下:

library(dplyr) 
test_data<-test_data %>% 
    count(col1,col2,col3) %>% 
    group_by(col1, col2) %>% 
    mutate(perc_res = prop.table(n)*100) 

我想這一點,它似乎幾乎做的伎倆,因爲我得到的總的1和0的COL3的百分比,在col1和2每一個值,但總樣本是錯誤的,因爲我指望所有的三列,當正確的計數將是唯一的col1和2

對於置信區間我會用以下內容:

binom.test(resistant samples,total samples)$conf.int*100 

但是我不知道怎麼樣與其他人一起實施。 有沒有簡單快捷的方法來做到這一點?

+0

我建議使用group_by然後使用匯總功能。 – Jul

+0

使用'data.frame(col1,col2,col3)',而不是'cbind',這會迫使每列在這裏串起來。 – Frank

+0

您的示例數據沒有(「avi」,「high」)對。您是否希望該行反正出現(使用NAs和零採樣數)? – Frank

回答

3

這應該這樣做。

library(tidyverse) 
library(broom) 

test_data %>% 
    mutate(col3 = ifelse(col3 == 0, "NonResistant", "Resistant")) %>% 
    count(col1, col2, col3) %>% 
    spread(col3, n, fill = 0) %>% 
    mutate(PercentResistant = Resistant/(NonResistant + Resistant)) %>% 
    mutate(test = map2(Resistant, NonResistant, ~ binom.test(.x, .x + .y) %>% tidy())) %>% 
    unnest() %>% 
    transmute(Species = col1, Pop.density = col2, PercentResistant, CI_low = conf.low * 100, CI_high = conf.high * 100, TotalSamples = Resistant + NonResistant) 
  1. 變異的0/1阻力列,以便它有可讀價值。
  2. 計算每個存儲桶中的值。
  3. 將col3/n分爲兩列,即Resistant/NonResistant,並將計數(n)放入這些列中。現在每行都有它做測試所需的一切。
  4. 計算百分比阻力
  5. 對每個存儲桶執行測試並將結果放入一個名爲test的嵌套幀中。
  6. Unnest測試數據框,以便您可以使用測試結果。
  7. 清理,給一切美好的名字。

結果

enter image description here

+0

偉大的解決方案!我可以問'conf.low'來自'unnest()'後面我只看到'估計值'和'統計量'嗎? – PoGibas

+1

@PoGibas:conf.low來自'tidy()',然後'unid'。如果你看到估計它應該在那裏。瘋狂的猜測,你的窗口不是那麼寬,在結果下面有「... X多變量」的東西? –

+0

aaa,它是'tbl_df','%>%data.frame()'顯示它。不能習慣'tibble' – PoGibas

4

我會做...

library(data.table) 
setDT(DT) 

DT[, { 
    bt <- binom.test(sum(resists), .N)$conf.int*100 
    .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N) 
}, keyby=.(species, popdens)] 

    species popdens res_rate res_lo res_hi n 
1:  avi  low 0.00000 0.000000 70.75982 3 
2:  avi  med 0.00000 0.000000 97.50000 1 
3:  bov  low 100.00000 15.811388 100.00000 2 
4:  bov  med 50.00000 1.257912 98.74209 2 
5:  bov high 100.00000 15.811388 100.00000 2 
6:  chi  low 0.00000 0.000000 97.50000 1 
7:  chi  med 50.00000 1.257912 98.74209 2 
8:  chi high 66.66667 9.429932 99.15962 3 
9:  fox  low 0.00000 0.000000 97.50000 1 
10:  fox  med 50.00000 1.257912 98.74209 2 

要包括所有級別(物種和種羣密度的組合)...

DT[CJ(species = species, popdens = popdens, unique = TRUE), on=.(species, popdens), { 
    bt <- 
    if (.N > 0L) binom.test(sum(resists), .N)$conf.int*100 
    else NA_real_ 
    .(res_rate = mean(resists)*100, res_lo = bt[1], res_hi = bt[2], n = .N)  
}, by=.EACHI] 

    species popdens res_rate res_lo res_hi n 
1:  avi  low 0.00000 0.000000 70.75982 3 
2:  avi  med 0.00000 0.000000 97.50000 1 
3:  avi high  NA  NA  NA 0 
4:  bov  low 100.00000 15.811388 100.00000 2 
5:  bov  med 50.00000 1.257912 98.74209 2 
6:  bov high 100.00000 15.811388 100.00000 2 
7:  chi  low 0.00000 0.000000 97.50000 1 
8:  chi  med 50.00000 1.257912 98.74209 2 
9:  chi high 66.66667 9.429932 99.15962 3 
10:  fox  low 0.00000 0.000000 97.50000 1 
11:  fox  med 50.00000 1.257912 98.74209 2 
12:  fox high  NA  NA  NA 0 

它如何w獸人

語法DT[i, j, by=]其中...

  • i確定行的子集,有時用輔助參數,on=roll=
  • by=確定子集表內的組,如果排序則切換到keyby=
  • j是代碼在每個組上的代碼。

j應該評估的名單,.()是一個快捷方式list()。有關詳細信息,請參閱?data.table

數據使用

(改名爲列,重新格式化二元變量回0/1或假/真,集人口密度按照正確的順序水平):

DT = data.frame(
    species = col1, 
    popdens = factor(col2, levels=c("low", "med", "high")), 
    resists = col3 
)