2013-03-16 53 views
2

我有一個數據文件(.csv),其中每個觀察值是333個區中的一個。每個區有一個像1101,1102,...的ID。其次,我還有另一個數據文件(.csv),其中每個觀察點都是112,975個城鎮之一,包括人口數據。城鎮數據有一個district_ID字段。每個地區大約有300個城鎮。因此,有一個地區district_ID == 1101和約300個城鎮district_ID == 1101匹配(和求和)R中的許多字段爲1

我想在我的分區數據集中創建一個分區級人口變量。這意味着將多個城鎮觀測數據與每個單一的地區觀測數據進行匹配,然後對鎮一級的人口進行總結。

謝謝!

回答

7

一個data.table解決方案:

#some example data 
set.seed(42) 
districts <- data.frame(district_ID=1:10,whatever=rnorm(10)) 
towns <- data.frame(town=1:100,district_ID=rep(1:10,each=10), 
        population=rpois(100,sample(c(1e3,1e4,1e5)))) 

library(data.table) 
districts <- data.table(districts,key="district_ID") 
towns <- data.table(towns,key="district_ID") 

#calculate district population 
temp <- towns[,list(district_pop=sum(population)),by=district_ID] 
#merge result with districts data.table 
districts <- merge(districts,temp) 

# district_ID whatever district_pop 
# 1:   1 1.37095845  434886 
# 2:   2 -0.56469817  334084 
# 3:   3 0.36312841  342241 
# 4:   4 0.63286260  433224 
# 5:   5 0.40426832  334039 
# 6:   6 -0.10612452  342810 
# 7:   7 1.51152200  433362 
# 8:   8 -0.09465904  333810 
# 9:   9 2.01842371  342035 
# 10:   10 -0.06271410  432302 
+0

+1所有這些工作,你仍然沒有收到什麼? – statquant 2013-03-16 18:47:52

+0

我怎麼能概括這個總和'城鎮'中的所有列,而不是隻有一個(以上,人口),由'district_ID'索引? – 2013-10-12 10:09:16

+1

'temp < - towns [,lapply(.SD,sum),by = district_ID]'也可能使用'.SDcols'。閱讀文檔。 – Roland 2013-10-12 10:43:33

4

編輯:與較大的數據集的基準。

計算每個區的使用功能tapply人口:

districtdata$population<- 
    tapply(towndata$population,towndata$district_ID,sum)[districts$district_ID] 

一些基準測試,只是爲了好玩:

fn1<-function(districts,towns) 
{ 
    districts$population<- 
     tapply(towns$population,towns$district_ID,sum)[districts$district_ID] 

    districts 
} 
fn2<-function(districts,towns) #Roland's data.table approach: 
{ 
    districts <- data.table(districts,key="district_ID") 
    towns <- data.table(towns,key="district_ID") 
    temp<-towns[,list(district_pop=sum(population)),by=district_ID] 
    merge(districts,temp) 
} 



set.seed(42) 
districts <- data.frame(district_ID=1:300,whatever=rnorm(300)) 
towns <- data.frame(town=1:100000,district_ID=rep(1:300,each=300), 
        population=rpois(300000,sample(c(1e3,1e4,1e5)))) 

microbenchmark(fn1(districts,towns),fn2(districts,towns)) 
Unit: milliseconds 
        expr  min  lq median  uq  max neval 
fn1(districts, towns) 215.29266 231.47103 243.72353 265.28280 355.43895 100 
fn2(districts, towns) 20.03636 27.51046 36.11116 58.56448 88.70766 100 
+1

您應該使用更大的數據集進行基準測試。 – Roland 2013-03-16 18:53:41

+0

@羅蘭是我同意,改變了基準。我有點驚訝,'tapply'太慢了。 – 2013-03-16 19:01:18

1

怎麼樣:

aggregate(population ~ district_ID, towns, sum) 

(基於Roland的綜合數據)