2017-03-31 86 views
-1

我想總結一個在R中的數據集。我是R中的初學者。下面的代碼工作,但有很多步驟。有沒有更簡單的方法來實現這一點?我想完成下列工作:在多條件下在R中聚合

1)由CLIENT_ID
2基)計數所有ClaimNumbers(是否與DS相關聯或不)
3)只計算權利要求數字與DS 4)之和零售和WS只適用於DS
5)另外,我想只計算一次索賠。在數據中,每個服務年份和服務都會重複一個索賠編號。

# example 

ds <- read.table(text = " 
Client_ID ClaimNumber ServiceYr Service Retail WS 
A00002   WC1  2012  DS 100 25 
A00002   WC1  2013  DS 100 25 
A00002   WC1  2014  BR  50 10 
A00002   WC2  2014  BR  50 10 
A00002   WC3  2014  BR  50 10 
A00003   WC4  2014  BR  50 10 
A00003   WC4  2015  BR  50 10 
A00003   WC5  2015  BR  50 10 
A00003   WC5  2016  BR  50 10 
A00003   WC6  2016  DS 100 25", 
       sep="",header=TRUE) 

# group by client ID and claim number to get one row per claim number 
total_claims <- sqldf("select Client_ID,ClaimNumber from ds group 
        by Client_ID,ClaimNumber") 

# For DS claims only - group by client ID and claim number 
# to get one row per claim number 
ds_claims <- sqldf("select Client_ID,ClaimNumber, sum(Retail) as Retail, 
    sum(WS) as WS from ds where Service='DS' group by Client_ID,ClaimNumber") 

# count the total number of claims by client 
total_counts <-  aggregate(total_claims[,2],b=list(total_claims$Client_ID),FUN=length) 

# fix column headers 
colnames(total_counts)[1:2] <- c("Client_ID","ClaimCount") 

# count the number of DS claims by client 
ds_claim_counts <- aggregate(ds_claims[,2],b=list(ds_claims$Client_ID),FUN=length) 

# fix column headers 
colnames(ds_claim_counts)[1:2] <- c("Client_ID","ClaimCount") 

# merge to get both total counts and ds counts on the same table 
total <- merge(total_counts,ds_claim_counts, by="Client_ID",all.x=TRUE) 

# merge to add ds retail and ws amounts to total table 
total <- merge(total,ds_claims[,c(1,3,4)], by="Client_ID",all.x=TRUE) 

# fix column headers 
colnames(total)[2:3] <- c("Total_CC","DS_CC") 
+0

請看看如何產生[這些技巧最低限度,com完整和可驗證的例子](http://stackoverflow.com/help/mcve),以及這篇文章[在R中創建一個很好的例子](http://stackoverflow.com/questions/5963269/how-to - 製作 - 一個偉大-R重現-例子)。 – lmo

回答

2

以下是一些備選給出同樣答案的問題的代碼:sqldf

library(sqldf) 

sqldf("select Client_ID, 
       count(distinct ClaimNumber) Total_CC, 
       count(distinct case when Service = 'DS' 
           then ClaimNumber 
           else NULL 
          end) DS_CC, 
       sum(Retail * (Service = 'DS')) Retail, 
       sum(WS * (Service = 'DS')) WS 
     from ds 
     group by Client_ID") 

1)贈送:

Client_ID Total_CC DS_CC Retail WS 
1 A00002  3  1 200 50 
2 A00003  3  1 100 25 

2 )data.table

library(data.table) 

DT <- as.data.table(ds) 
DT[, list(Total_CC = length(unique(ClaimNumber)), 
      DS_CC = length(unique(ClaimNumber[Service == "DS"])), 
      Retail = sum(Retail * (Service == "DS")), 
      WS = sum(WS * (Service == "DS"))), by = Client_ID] 

,並提供:

Client_ID Total_CC DS_CC Retail WS 
1: A00002  3  1 200 50 
2: A00003  3  1 100 25 

3)dplyr

library(dplyr) 

ds %>% 
    group_by(Client_ID) %>% 
    summarize(Total_CC = length(unique(ClaimNumber)), 
      DS_CC = length(unique(ClaimNumber[Service == "DS"])), 
      Retail = sum(Retail * (Service == "DS")), 
      WS = sum(WS * (Service == "DS"))) %>% 
    ungroup 

捐贈:

# A tibble: 2 × 5 
    Client_ID Total_CC DS_CC Retail WS 
    <fctr> <int> <int> <int> <int> 
1 A00002  3  1 200 50 
2 A00003  3  1 100 25 
+0

謝謝!這非常有幫助!我很感激。 – user3670204