2013-01-24 40 views
1

我有一個外部表有一列錯誤的總和 - 數據,其中數據是JSON對象蜂巢計算JSON對象

當我運行下面的蜂巢查詢

hive> select get_json_object(data, "$.ev") from data_table limit 3;  

Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks is set to 0 since there's no reduce operator 
Starting Job = job_201212171824_0218, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201212171824_0218 
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201212171824_0218 
2013-01-24 10:41:37,271 Stage-1 map = 0%, reduce = 0% 
.... 
2013-01-24 10:41:55,549 Stage-1 map = 100%, reduce = 100% 
Ended Job = job_201212171824_0218 
OK 
2 
2 
2 
Time taken: 21.449 seconds 

但是,當我運行總和聚合的結果是奇怪的

hive> select sum(get_json_object(data, "$.ev")) from data_table limit 3; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
set mapred.reduce.tasks=<number> 
Starting Job = job_201212171824_0217, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201212171824_0217 
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201212171824_0217 
2013-01-24 10:39:24,485 Stage-1 map = 0%, reduce = 0% 
..... 
2013-01-24 10:41:00,760 Stage-1 map = 100%, reduce = 100% 
Ended Job = job_201212171824_0217 
OK 
9.4031522E7 
Time taken: 100.416 seconds 

任何人都可以解釋我爲什麼?我應該怎麼做才能正常工作?

回答

1

蜂巢似乎需要在你的JSON的值作爲float s,而不是int S,它看起來像你的表是相當大的,從而蜂巢可能是使用大浮點數的「指數」的符號,所以9.4031522E7大概意思94031522

如果你想確保你正在做一個sum過整型,你可以施展領域的JSON的爲int和總和應該能夠回報你一個int:

$ hive -e "select sum(get_json_object(dt, '$.ev')) from json_table" 
8.806305E7 
$ hive -e "select sum(cast(get_json_object(dt, '$.ev') as int)) from json_table" 
88063050 
+0

大,肯定!非常感謝 – Julias