2016-02-25 53 views
1

這裏是我的代碼,我做了兩組所有的操作和我的代碼作品。我的目的是根據總分數生成所有學生唯一用戶數,學生位於CA唯一用戶數。想知道是否有好的建議可以使我的代碼簡單地只使用一個組操作,或者任何有建設性的想法來簡化代碼,例如只使用一個FOREACH操作?謝謝。建議讓我的下面的豬代碼簡單

student_all = group student all; 
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2; 

student_CA = filter student by LID==1; 
student_CA_all = group student_CA all; 
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA); 

樣品輸入(學生證,位置ID,mathScore,verbScore),

1 1 10 20 
2 1 20 30 
3 1 30 40 
4 2 30 50 
5 2 30 50 
6 3 30 50 

樣本輸出(唯一的用戶,在CA獨特的用戶,所有學生的mathScore的總和,動詞得分的總和所有的學生),

7 3 150 240 

由於事先的 林

+1

我真的不知道豬所以不能給你一個確切的答案,但在概念上我想你想'GROUP BY學生(學生,LID)',向下彙總數據到一個更可管理的大小,但仍然保留你需要的粒度,那麼你的'FOREACH'聚合將會快得多 – maxymoo

+0

@maxymoo,你指的是按位置分組(LID)而已?或通過地點+學生ID? –

+1

這兩個組合,那麼你將有一個大小爲#locations * #students的表格。然後可以根據地點進行過濾或者將所有東西加起來以得到兩種類型的聚合 – maxymoo

回答

1

你可能是廁所國王爲此。

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int); 

gdata = group data all; 

result = foreach gdata { 
     student_CA = filter data by lid == 1; 
     student_CA_sum = SUM(student_CA.sid) ; 
     student_CA_count = COUNT(student_CA.sid) ; 
     mathScore = SUM(data.ms); 
     verbScore = SUM(data.vs); 
     GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore; 
}; 

輸出是:

grunt> dump result 
    (6,3,150,240) 
grunt> describe result 
    result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long} 
+0

謝謝Mahendra,你爲什麼COUNT(student_CA.sid),COUNT(student_CA)除外; –

+0

嗨Mahendra,感謝您的幫助,並將您的答覆標記爲已回答。週末愉快。 :) –

+1

@LinMa'COUNT(student_CA.sid)'和'COUNT(student_CA)'會導致相同的結果,我錯誤地使用了複製粘貼。 – Mahendra

1

第一負載在hadoop的文件系統中的文件(學生)。執行下面的動作。

split student into student_CA if locationId == 1, student_Other if locationId != 1; 

student_CA_all = group student_CA all; 

student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount; 

student_Other_all = group student_Other all; 

student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount; 

student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary; 

student_summary_all = group student_CAandOther_all_summary all; 

student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount; 

輸出:

dump student_summary; 
(6,3,150,240) 

希望這有助於:)

在解決你的問題,我也遇到一個問題,用豬。我認爲這是因爲在UNION命令中進行了不正確的異常處理。實際上,如果執行該命令,它可能會掛起命令行提示符,而沒有正確的錯誤消息。如果你想我可以分享你的片段。

+0

感謝hello_abhishek,不錯的代碼,投票! –

1

接受的答案有一個邏輯錯誤。

嘗試有以下輸入文件

1 1 10 20 
2 1 20 30 
3 1 30 40 
4 2 30 50 
5 2 30 50 
6 3 30 50 
7 1 10 10 

輸出將被

(13,4,160,250) 

輸出應該

(7,4.170,260) 

我修改劇本的工作是正確的。

data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int); 

gdata = group data all; 

result = foreach gdata { 
    student_CA_sum = COUNT(data.sid) ; 
    student_CA = filter data by lid == 1; 
    student_CA_count = COUNT(student_CA.sid) ; 
    mathScore = SUM(data.ms); 
    verbScore = SUM(data.vs); 
    GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore; 

};

輸出

(7,4,160,250) 
+0

感謝hello_abhishek,不錯的代碼,投票。 :) –