2016-02-13 65 views
1

重新加權我有一噸由國家,日期和UPC(產品代碼)索引的價格數據。我想彙總UPC,並通過加權平均結合價格。我會盡力解釋它,但您可能只想閱讀下面的代碼。通過彙總和的indeces R中

數據集中的每個觀察是:UPC,日期,狀態,價格和重量。我想離開聚集在這樣的UPC指數:

採取所有的數據點具有相同的日期和狀態,以及它們的權重多的價格,總結起來。這顯然創建了一個加權平均數,我稱之爲priceIndex。但是,對於某個日期的&狀態組合,權重不會累加爲1.因此,我想創建兩個附加列:一個用於每個日期&狀態組合的權重總和。第二個是重新加權平均值:也就是說,如果原來的兩個權重是.5和.3,將它們改爲.5 /(.5 + .3)= .625和.3 /(.5 + .3)= .375,然後將加權平均值重新計算爲另一個價格指數。

這就是我的意思是:

upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001) 
date=c(200601,200602,200603,200603,200601,200602,200601,200602,200603,200601,200602,200601,200602,200603,200601) 
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47) 
state=c(1,1,1,2,2,2,1,1,2,2,2,1,1,1,2) 
weight=c(.3,.2,.6,.4,.4,.5,.5,.5,.45,.15,.5,.2,.15,.3,.45) 

# This is what I have: 
data <- data.frame(upc,date,state,price,weight) 
data 

# These are a few of the weighted calculations: 
# .3*26+85*.5+24*.2 = 55.1 
# 28*.2+84*.5+19*.15 = 50.45 
# 27*.6+98*.3 = 45.6 
# Etc. etc. 

# Here is the reweighted calculation for date=200602 & state==1: 
# 28*(.2/.85)+84*(.5/.85)+19*(.15/.85) = 50.45 
# Or, equivalently: 
# (28*.2+84*.5+19*.15)/.85 = 50.45 

# This is what I want: 
date=c(200601,200602,200603,200601,200602,200603) 
state=c(1,1,1,2,2,2) 
priceIndex=c(55.1,50.45,45.6,42.5,51,46.575) 
totalWeight=c(1,.85,.9,1,1,.85) 
reweightedIndex=c(55.1,59.35294,50.66667,42.5,51,54.79412) 
index <- data.frame(date,state,priceIndex,totalWeight,reweightedIndex) 
index 

而且,不是它應該的問題,但也有35州,150點的UPC,並在數據集84個日期 - 所以有很多意見。

非常感謝。

回答

2

我們可以通過總結操作使用其中的一個組。隨着data.table,我們轉換「data.frame」到「data.table」(setDT(data),通過「日期」,「國家」,我們得到了分組的「價格」和「重量」,並作爲sum(weight)臨時變量的產品sum ,然後創建在list的3個變量基礎上。

library(data.table) 
setDT(data)[, {tmp1 = sum(price*weight) 
       tmp2 = sum(weight) 
     list(priceIndex=tmp1, totalWeight=tmp2, 
       reweigthedIndex = tmp1/tmp2)}, .(date, state)] 
# date state priceIndex totalWeight reweightedIndex 
#1: 200601  1  55.100  1.00  55.10000 
#2: 200602  1  50.450  0.85  59.35294 
#3: 200603  1  45.600  0.90  50.66667 
#4: 200603  2  46.575  0.85  54.79412 
#5: 200601  2  42.500  1.00  42.50000 
#6: 200602  2  51.000  1.00  51.00000 

或者使用dplyr,我們可以使用summarise做的「日期」和「狀態」分組後創造了3列。

library(dplyr) 
data %>% 
    group_by(date, state) %>% 
    summarise(priceIndex = sum(price*weight), 
      totalWeight = sum(weight), 
      reweightedIndex = priceIndex/totalWeight) 
# date state priceIndex totalWeight reweightedIndex 
# (dbl) (dbl)  (dbl)  (dbl)   (dbl) 
#1 200601  1  55.100  1.00  55.10000 
#2 200601  2  42.500  1.00  42.50000 
#3 200602  1  50.450  0.85  59.35294 
#4 200602  2  51.000  1.00  51.00000 
#5 200603  1  45.600  0.90  50.66667 
#6 200603  2  46.575  0.85  54.79412 
+0

對於dplyr之一,當我輸入時,我只得到一行? – ejn

+1

@ejn你可以使用'dplyr :: summarise'(如果你還加載了'plyr' – akrun