2015-02-12 38 views
1

,我有以下數據,其中居民人按年齡排序(舊到新):計算分組的行之間的最大區別

data houses;    
input HouseID PersonID Age;  
datalines;    
1 1 25      
1 2 20     
2 1 32 
2 2 16 
2 3 14 
2 4 12 
3 1 44 
3 2 42 
3 3 10 
3 4 5 
; 
run; 

我想計算每個家庭連續歲之間的最大年齡差人。因此,這個例子將連續爲住戶1,2和3提供5(= 25-20),16(= 32-16)和32(= 42-10)的值。

我可以使用大量合併(即提取人員1,合併提取人員2等),但因爲可以有多達20多人在一個家庭中,我正在尋找更多直接法。

回答

6

這是一個雙通解決方案。與上述兩種解決方案相同的第一步,按年齡分類。在第二步中,跟蹤每行的max_diff,在HouseID的最後一個記錄中輸出結果。這導致只有兩次通過數據。

proc sort data=houses; by houseid age;run; 

data want; 
set houses; 
by houseID; 

retain max_diff 0; 

diff = dif1(age)*-1; 

if first.HouseID then do; 
    diff = .; max_diff=.; 
end; 

if diff>max_diff then max_diff=diff; 
if last.houseID then output; 

keep houseID max_diff; 
run; 
+0

只要注意開始時,OP表示應該按年齡遞減。它在這裏工作,因爲personid似乎首先按照最老的人的順序分配,然而實際數據可能並非如此。 – Longfish 2015-02-12 10:30:07

+0

你是對的,我只是複製並粘貼了最初的代碼。我將編輯解決方案,謝謝! – Reeza 2015-02-12 14:37:09

2
proc sort data=houses; by houseid personid age;run; 

data _t1; 
set houses; 
diff = dif1(age) * (-1); 
if personid = 1 then diff = .; 
run; 


proc sql; 
create table want as 
select houseid, max(diff) as Max_Diff 
from _t1 
group by houseid; 
+0

只是在開始的時候要小心,操作程序聲明它應該是按年齡遞減。它在這裏工作,因爲personid似乎首先按照最老的人的順序分配,然而實際數據可能並非如此。 – Longfish 2015-02-12 10:30:30

+0

好評。在這種情況下,personid是通過家庭內的年齡降序來歸因的,但對於其他用戶的問題,情況可能並非如此。 – user2568648 2015-02-12 10:41:25

+0

完全代碼的答案通常不被認爲是一個好答案。答案應該解釋他們的工作方式/原因 - 事實上,答案的一部分比代碼更重要。 – Joe 2015-02-12 15:37:01

2
proc sort data = house; 
by houseid descending age; 
run; 

data house; 
set house; 
by houseid; 
lag_age = lag1(age); 
if first.houseid then age_diff = 0; 
age_diff = lag_age - age; 
run; 

proc sql; 
select houseid,max(age_diff) as max_age_diff 
from house 
group by houseid; 
quit; 

工作:

首先排序的數據集採用houseid和下降時期。 第二個數據步驟將計算當前年齡值(以PDV爲單位)與PDV中之前的年齡值之間的差異。然後,使用sql程序,我們可以得到每個houseid的最大年齡差異。

+0

非常感謝。由此我只需要給每個房屋中最老的人添加一個虛擬值,因爲這些人的age_diff是通過從前一個家庭的最小的人中減去他們的年齡來計算的,即在這個例子中,房屋3的人1的age_diff是計算爲-32。這可能會導致錯誤,例如,如果房子2中最小的人年齡在80歲,那麼age_diff會= 36,因此max(age_diff)將是36而不是正確的值32. – user2568648 2015-02-12 10:07:23

2

只是投入一個混合。這是Reeza迴應的精簡版本。

/* No need to sort by PersonID as age is the only concern */ 
proc sort data = houses; 
    by HouseID Age; 
run; 
data want; 
    set houses; 
    by HouseID; 
    /* Keep the diff when a new row is loaded */ 
    retain diff; 
    /* Only replace the diff if it is larger than previous */ 
    diff = max(diff, abs(dif(Age))); 
    /* Reset diff for each new house */ 
    if first.HouseID then diff = 0; 
    /* Only output the final diff for each house */ 
    if last.HouseID; 
    keep HouseID diff; 
run; 
0

下面是一個使用FIRST. and LAST.的例子,在數據中進行一次(排序後)。

data houses;    
input HouseID PersonID Age;  
datalines;    
1 1 25      
1 2 20     
2 1 32 
2 2 16 
2 3 14 
2 4 12 
3 1 44 
3 2 42 
3 3 10 
3 4 5 
; 
run; 

Proc sort data=HOUSES; 
by houseid descending age ; 
run; 

Data WANT(keep=houseid max_diff); 
format houseid max_diff; 
retain max_diff age1 age2; 
Set HOUSES; 

by houseid descending age ; 

if first.houseid and last.houseid then do; 
    max_diff=0; 
    output; 
end; 
else if first.houseid then do; 
    call missing(max_diff,age1,age2); 
    age1=age; 
end; 
else if not(first.houseid or last.houseid) then do; 
    age2=age; 
    temp=age1-age2; 
    if temp>max_diff then max_diff=temp; 
    age1=age; 
end; 
else if last.houseid then do; 
    age2=age; 
    temp=age1-age2; 
    if temp>max_diff then max_diff=temp; 
    output; 
end; 
Run;