匹配（和求和）R中的許多字段爲1

我有一個數據文件（.csv），其中每個觀察值是333個區中的一個。每個區有一個像1101,1102，...的ID。其次，我還有另一個數據文件（.csv），其中每個觀察點都是112,975個城鎮之一，包括人口數據。城鎮數據有一個district_ID字段。每個地區大約有300個城鎮。因此，有一個地區district_ID == 1101和約300個城鎮district_ID == 1101。匹配（和求和）R中的許多字段爲1

我想在我的分區數據集中創建一個分區級人口變量。這意味着將多個城鎮觀測數據與每個單一的地區觀測數據進行匹配，然後對鎮一級的人口進行總結。

謝謝！

來源

2013-03-16 Dr. Beeblebrox

一個data.table解決方案：

#some example data 
set.seed(42) 
districts <- data.frame(district_ID=1:10,whatever=rnorm(10)) 
towns <- data.frame(town=1:100,district_ID=rep(1:10,each=10), 
        population=rpois(100,sample(c(1e3,1e4,1e5)))) 

library(data.table) 
districts <- data.table(districts,key="district_ID") 
towns <- data.table(towns,key="district_ID") 

#calculate district population 
temp <- towns[,list(district_pop=sum(population)),by=district_ID] 
#merge result with districts data.table 
districts <- merge(districts,temp) 

# district_ID whatever district_pop 
# 1:   1 1.37095845  434886 
# 2:   2 -0.56469817  334084 
# 3:   3 0.36312841  342241 
# 4:   4 0.63286260  433224 
# 5:   5 0.40426832  334039 
# 6:   6 -0.10612452  342810 
# 7:   7 1.51152200  433362 
# 8:   8 -0.09465904  333810 
# 9:   9 2.01842371  342035 
# 10:   10 -0.06271410  432302

來源

2013-03-16 18:04:18 Roland

+1所有這些工作，你仍然沒有收到什麼？ – statquant 2013-03-16 18:47:52

我怎麼能概括這個總和'城鎮'中的所有列，而不是隻有一個（以上，人口），由'district_ID'索引？ – 2013-10-12 10:09:16

'temp < - towns [，lapply（.SD，sum），by = district_ID]'也可能使用'.SDcols'。閱讀文檔。 – Roland 2013-10-12 10:43:33

編輯：與較大的數據集的基準。

計算每個區的使用功能tapply人口：

districtdata$population<- 
    tapply(towndata$population,towndata$district_ID,sum)[districts$district_ID]

一些基準測試，只是爲了好玩：

fn1<-function(districts,towns) 
{ 
    districts$population<- 
     tapply(towns$population,towns$district_ID,sum)[districts$district_ID] 

    districts 
} 
fn2<-function(districts,towns) #Roland's data.table approach: 
{ 
    districts <- data.table(districts,key="district_ID") 
    towns <- data.table(towns,key="district_ID") 
    temp<-towns[,list(district_pop=sum(population)),by=district_ID] 
    merge(districts,temp) 
} 



set.seed(42) 
districts <- data.frame(district_ID=1:300,whatever=rnorm(300)) 
towns <- data.frame(town=1:100000,district_ID=rep(1:300,each=300), 
        population=rpois(300000,sample(c(1e3,1e4,1e5)))) 

microbenchmark(fn1(districts,towns),fn2(districts,towns)) 
Unit: milliseconds 
        expr  min  lq median  uq  max neval 
fn1(districts, towns) 215.29266 231.47103 243.72353 265.28280 355.43895 100 
fn2(districts, towns) 20.03636 27.51046 36.11116 58.56448 88.70766 100

來源

2013-03-16 17:56:45

您應該使用更大的數據集進行基準測試。 – Roland 2013-03-16 18:53:41

@羅蘭是我同意，改變了基準。我有點驚訝，'tapply'太慢了。 – 2013-03-16 19:01:18

怎麼樣：

aggregate(population ~ district_ID, towns, sum)

（基於Roland的綜合數據）

來源

2013-03-16 19:20:56 texb

匹配（和求和）R中的許多字段爲1

回答

相關問題