2016-10-03 57 views
0

我有一個關於國家的數據,並且希望總結它並創建一個表。計數兩個變量的組合,不包括重複ID的行

> head(data) 
     country year score members 
       A 1989  0  7 
       A 1990  0  7 
       A 1991  0  7 
       A 1992  0  7 
       A 1993  0  7 
       A 1994  0  7 

表應顯示國家「分數」和「成員」的數量之間的關係 - 換言之,我想看看有多少國家與評分0,1或2具有「成員」(從1至7)。

我想設置這樣的:

score members==1 members==2 members==3 members==4 members==5 members==6 members==7 
0  1   0 
1  2   0 
2  0   1 and so on.. 

要做到這一點,我運行以下命令:

library(dplyr) 
    table <- data %>% 
     group_by(score) %>% 
     summarise(
     m1 = sum(members==1, na.rm=TRUE), 
     m2 = sum(members==2, na.rm=TRUE), 
     m3 = sum(members==3, na.rm=TRUE), 
     m4 = sum(members==4, na.rm=TRUE), 
     m5 = sum(members==5, na.rm=TRUE), 
     m6 = sum(members==6, na.rm=TRUE), 
     m7 = sum(members==7, na.rm=TRUE) 

    ) 

這給:

score m1 m2 m3 m4 m5 m6 m7 
     0  0  2  0  0  0  3 30 
     1 15  3 11 11  3 18  3 
     2  3  0  2  2  0  6  9 
. 
. 

我在這裏需要一些幫助。正如你所看到的,它已經計算了觀測的總數,而我只想每個國家只統計一次。

如何總結這些數據以獲得每個成員級別的國家總數?

這裏是我的數據的重複性樣本:

data <- 
structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"), 
    year = c(1989L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 
    1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
    2005L, 2006L, 2007L, 2008L, 2010L, 1989L, 1990L, 1991L, 1992L, 
    1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 
    2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 
    2011L, 1989L, 1991L, 1993L, 1994L, 1995L, 1996L, 1997L, 1999L, 
    2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
    2010L, 1989L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 
    1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
    2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1991L, 1992L, 1993L, 
    1994L, 1995L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 
    2004L, 2005L, 2006L, 2007L, 2008L, 2010L, 1991L, 1992L, 1993L, 
    1994L, 1995L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 
    2004L, 2005L, 2006L, 2007L, 2008L, 2010L), score = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 
    1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 2L, 2L, 
    2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 
    2L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L 
    ), members = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 6L, 6L, 6L, 6L, 6L, 
    6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
    4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
    4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L)), .Names = c("country", "year", "score", 
"members"), class = "data.frame", row.names = c(NA, -121L)) 
+3

'用(數據表(評分,成員))' – Frank

+1

或' (數據,表格(分數,成員,國家)),如果它必須是每個國家 – Cath

+0

什麼是你想要的輸出? – Cath

回答

3

由於OP使用dplyr方法,我們可以通過「分數」,「會員」分組得到的元素數量做到這一點(n() ),然後spread(來自tidyr)將其重新整形爲'寬'格式。

library(dplyr) 
library(tidyr) 
data %>% 
    group_by(score, members) %>% 
    summarise(n = n()) %>% 
    mutate(members = paste0("m", members)) %>% 
    spread(members, n, fill = 0) 
# score m1 m2 m3 m4 m5 m6 m7 
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1  0  0  2  0  0  0  3 30 
#2  1 15  3 11 11  3 18  3 
#3  2  3  0  2  2  0  6  9 

如果我們需要通過「國家」也得到了計數,只需添加「國家」在group_by

data %>% 
    group_by(country, score, members) %>% 
    summarise(n = n()) %>% 
    mutate(members = paste0("m", members)) %>% 
    spread(members, n, fill = 0) 

如果預期的輸出是其他職位的一個表現,使用data.table的選項是將'data.frame'轉換爲'data.table'(setDT(data),將dcast從'long'轉換爲'wide',將'value.var'變量的fun.aggregate指定爲uniqueN,即'國家'在哪裏uniqueN返回「國家」列中的unique元素的lengthfill=0指定爲那些不可用的組合佔用0。默認情況下,它返回爲NA。

library(data.table) 
dcast(setDT(data), score~members, value.var= 'country', fun.aggregate = uniqueN, fill = 0) 
# score 1 2 3 4 5 6 7 
#1:  0 0 1 0 0 0 1 2 
#2:  1 1 1 2 2 1 3 2 
#3:  2 1 0 1 2 0 1 1 
+0

這樣就是OP得到的輸出,但不想得到......:*正如你所看到的那樣,它已經計算了觀察的總數,而不是每個國家* – Cath

+0

是的(我真的看到了2個包和許多線獲得相同的btw),但仍然,根據OP的評論,我不認爲這是他們想要的 – Cath

+2

'group_by(score,members)%>%summarize(n = n())'可以寫成count(score,members)' – Axeman

4

我相信你需要這樣的:

library(reshape2) 
dcast(aggregate(country~score+members, data=data, FUN=function(x) length(unique(x))), 
     score~members, value.var="country", fill=0L) 
# score 1 2 3 4 5 6 7 
#1  0 0 1 0 0 0 1 2 
#2  1 1 1 2 2 1 3 2 
#3  2 1 0 1 2 0 1 1 

或者,把它的dplyr/tidyr方式:

data %>% 
    group_by(members, score) %>% 
    summarise(n=n_distinct(country)) %>% 
    spread(members, n, fill=0L) 

## A tibble: 3 x 8 
# score  1  2  3  4  5  6  7 
#* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1  0  0  1  0  0  0  1  2 
#2  1  1  1  2  2  1  3  2 
#3  2  1  0  1  2  0  1  1 
+0

@akrun加油,我們只是沒有以同樣的方式理解Q,我暗示了我認爲是OP想要的東西。我們沒有得到相同的輸出,除了現在,根據我的回答,您可以選擇獲得其他輸出... – Cath

+1

我認爲最好發佈'n_distinct(國家)',因爲它是dplyr的方式。我會刪除我的。我想把你的評論的bcz張貼在下面的答案上 – akrun

2

看來問題的關鍵是有重複的行每年?在這種情況下,您可以使用distinct刪除它們,那麼它就是一個簡單的交叉表。您可以使用%$%博覽會管從magrittr:

library(dplyr) 
library(magrittr) 
data %>% 
    distinct(country, score, members) %$% 
    table(score, members) 

    members 
score 1 2 3 4 5 6 7 
    0 0 1 0 0 0 1 2 
    1 1 1 2 2 1 3 2 
    2 1 0 1 2 0 1 1 

或常規管crosstab從傳達室包:

library(dplyr) 
library(janitor) 
data %>% 
    distinct(country, score, members) %>% 
    crosstab(score, members) 

    score 1 2 3 4 5 6 7 
1  0 0 1 0 0 0 1 2 
2  1 1 1 2 2 1 3 2 
3  2 1 0 1 2 0 1 1 
相關問題