2014-12-02 80 views
4

我有一個Hive表,它跟蹤在進程的各個階段中移動的對象的狀態。該表是這樣的:使用python轉換函數的Hive:「無法識別'transform'附近的輸入」「錯誤

hive> desc journeys; 
object_id   string          
journey_statuses array<string> 

這裏有一個記錄的一個典型的例子:採用蜂巢0.13的collect_list產生

12345678 ["A","A","A","B","B","B","C","C","C","C","D"] 

在表中的記錄和狀態有一個訂單(如果爲了並不重要,我會用collect_set)。對於每個object_id,我想縮短旅程以按照它們出現的順序返回旅程狀態。

我寫了一個快速的Python腳本,從標準輸入讀取:

#!/usr/bin/env python 
import sys 
import itertools 

for line in sys.stdin: 
    inputList = eval(line.strip()) 
    readahead = iter(inputList) 
    next(readahead) 
    result = [] 
    for id, (a, b) in enumerate(itertools.izip(inputList, readahead)): 
     if id == 0: 
      result.append(a) 
     if a != b: 
      result.append(b) 
    print result 

我計劃在蜂房transform調用中使用此。看來工作時,本地運行:

$ echo '["A","A","A","B","B","B","C","C","C","C","D"]' | python abbreviate_list.py 
['A', 'B', 'C', 'D'] 

然而,當我添加了文件,並嘗試蜂巢內執行,則返回一個錯誤:

hive> add file abbreviateList.py;                   
Added resource: abbreviateList.py 

hive> select 
    > object_id, 
    > transform(journey_statuses) using 'python abbreviateList.py' as journey_statuses_abbreviated 
    > from journeys; 
NoViableAltException(... wall of Java error messages ...) 
FAILED: ParseException line 3:2 cannot recognize input near 'transform' '(' 'journey_statuses' in select expression 

你能看到我在做什麼錯?

回答

5

顯然你不能選擇不在變換中的其他字段(在你的例子中,object_id)。這其他的SO問題似乎間接地解決:

How can select a column and do a TRANSFORM in Hive?

理論上你可以修改你的Python接受OBJECT_ID作爲輸入參數,並使其成爲直通到另一個輸出字段,如果你需要將它收錄在輸出中。