帶引導函數的hive列中檢測序列

我試圖在我的配置單元表的列中檢測到一個序列。我有3列（ID，標籤，索引）。每個ID具有標籤的序列和索引標籤的排序，像帶引導函數的hive列中檢測序列

id label index 
a x 1 
a y 2 
a x 3 
a y 4 
b x 1 
b y 2 
b y 3 
b y 4 
b x 5 
b y 6

我想，如果X，Y，X的標籤序列，Y發生辨認。
我想嘗試鉛函數來完成此類似：

select id, index, label, 
lead(label, 1) over (partition by id order by index) as l1_fac, 
lead(label, 2) over (partition by id order by index) as l2_fac, 
lead(label, 3) over (partition by id order by index) as l3_fac 
from mytable

產量：

id index label l1_fac l2_fac l3_fac 
a 1 x y x y 
a 2 y x y NULL 
a 3 x y NULL NULL 
a 4 y NULL NULL NULL 
b 1 x y y y 
b 2 y y y x 
b 3 y y x y 
b 4 y x y NULL 
b 5 x y NULL NULL

其中L1（2,3）是下一個標籤值。然後，我可以檢查的模式與

where label = l2_fac and l1_fac = l3_fac

這將爲ID =工作一個，但不是ID = b，其中標籤序列爲：X，Y，Y，Y，Y，X。我不在乎，這是連續3年，我只是感興趣，它從x到y到x到y。

我不確定這是否可能，我嘗試了group by和partition的組合，但沒有成功。

來源

2015-07-21 boxl

當*序列'xyxy'發生時，你關心*嗎？即在什麼'索引'它發生？或者你只是想知道它發生在某個特定的id的某處？ – gobrewers14

不，我不在乎索引是什麼，只是它發生。 – boxl

我回答了this question其中OP想收集項目到列表並刪除任何重複的項目。我認爲這基本上是你想要做的。這將提取實際的xyxy序列，並且還會解釋您的第二個示例，其中xyxy發生，但被另外兩個y s污染。您需要使用this UDAF將label列收集到數組中 - 這將保留順序 - 然後使用我引用的UDF，然後可以使用concat_ws將此數組的內容設置爲字符串，最後，檢查該字符串是否爲您想要的序列的發生。函數instr將吐出第一次出現的位置，如果它從未找到該字符串，則爲零。

查詢：

add jar /path/to/jars/brickhouse-0.7.1.jar; 
add jar /path/to/other/jar/duplicates.jar; 

create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates'; 
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF'; 

select id, label_string, instr('xyxy', label_string) str_flg 
from (
    select id, concat_ws('', no_dups) label_string 
    from (
    select id, remove_seq_dups(label_array) no_dups 
    from (
     select id, collect(label) label_array 
     from db.table 
     group by id) x 
     ) y 
    ) z

輸出：

id label_string str_flg 
============================ 
a xyxy   1 
b xyxy   1

一個更好的選擇可能是簡單地收集label與UDF，使它成爲一個字符串，然後正則表達式出序列xyxy但是我在正則表達式中非常糟糕，所以可能有人可以對此進行智能評論。

來源

2015-07-22 15:03:10 gobrewers14

我希望得到一個非udf的答案，但我會嘗試。謝謝。 – boxl

帶引導函數的hive列​​中檢測序列

回答

相關問題

帶引導函數的hive列中檢測序列