通過AVG篩選使用Apache豬

我有一個文件「error_data.txt」如下：通過AVG篩選使用Apache豬

10474 3.0 2013-05-01 7 
10474 94.0 2013-05-01 3 
10538 72.0 2013-05-01 15 
11001 95.0 2013-05-01 248 
13113 78.0 2013-05-01 18 
13116 53.0 2013-05-01 4 
13116 95.0 2013-05-01 1 
13122 89.0 2013-05-01 2 
10001 56.0 2013-05-02 7 
10413 61.0 2013-05-02 6 
......... 
.........

這是我到現在爲止，它工作正常：

error_data = LOAD 'error_data.txt' AS (ppapi_error_code:int, api_version:chararray, day:chararray, count:long); 
filtered_data = FILTER error_data BY api_version=='61.0';              
grouped_data = GROUP filtered_data BY day;                  
grouped_count = FOREACH grouped_data GENERATE group AS day, SUM(filtered_data.count) AS error_count; 
STORE grouped_count INTO 'out_1';

現在我想要做的是爲那些大於平均值的error_count的值過濾grouped_count。

grouped_count_bag = GROUP grouped_count ALL; 
average = FOREACH grouped_count_bag GENERATE AVG(grouped_count.error_count);

當我DUMP它，我在一個元組作爲(578.9444444444445)獲得的價值：

如下我已經獲得了平均水平。我現在可以用值來過濾它作爲

filtered_grouped_count = FILTER grouped_count BY (error_count>578.9444444444445);

，但我想這樣做是

filtered_grouped_count = FILTER grouped_count BY (error_count>average);

這似乎並沒有被允許。任何援助將不勝感激。

來源

2013-05-15 Roney Michael

average = FOREACH grouped_count_bag GENERATE AVG(grouped_count.error_count) AS avg; 
grouped_count_average = CROSS grouped_count, average; 
filtered_grouped_count = FILTER grouped_count_average BY (error_count>avg);

我知道CROSS看起來很浪費，但據我所知，這是唯一的方法。

來源

2013-05-15 17:31:14 Eli

謝謝你的迴應。我只能在明天嘗試一下，因爲我目前沒有我的虛擬機。我會盡快回復。 –

@RoneyMichael你不需要一個虛擬主機。只需在本地安裝豬並嘗試以上。如果你在Mac上：'brew install pig'。然後只是'pig -x local'與從上面複製的虛擬數據，併發布命令:) – Eli

Windows ..我不在我的系統上;沒有Cygwin。：/ –

通過AVG篩選使用Apache豬

回答

相關問題