2017-08-04 46 views
0

我有一個像下面這樣的標記文件,它包含從源系統將4個不同文件合併爲一個文件的數據。基於標題記錄的Hive查詢結果

NEWFILE =是數據的分隔符。例如NEWFILE = STUDENT行後面的所有數據,直到NEWFILE = SUBJECT行都屬於STUDENT文件。 問題是我們沒有任何模式來分隔每個文件的記錄。 另外源系統不能將文件分成4個文件。

我需要加載這個單一的輸入文件,並根據記錄的標題分開記錄。

我所做的是將數據加載到Hive表中,並嘗試了ROW_NUMBER &隨機函數。

我想過使用ROW_NUMBER函數來標識每個頭的行,然後過濾標題行之間的記錄,但是ROW_NUMBER函數輸出與文件的實際行順序不同。由於這一行屬於學生可能被分配到SUBJECT。

我不能使用隨機函數,因爲它也並沒有給實際的行號

文件內容數據如下

NEWFILE=STUDENT 
100 XYZ 
101 ABC 
102 DEF 
NEWFILE=SUBJECT 
1 ENGLISH 
2 MATHS 
NEWFILE=TEACHERS 
110 AAAAAAAA 
111 BBBBBBB 
222 CCCCCCC 
333 DDDDDD 
NEWFILE=CLASSES 
1 CLASS-1 
2 CLASS-2 

給請告知我如何能實現我想要的輸出。

+0

該文件中的實際數據是在單獨的線,但它沒有正確表示上述部分向上。嘗試再次將其粘貼到評論部分。 – Nat

+0

刪除圖片並改爲放置文本樣本。選擇文本並應用ctrl + k將其格式化爲代碼 –

+0

ctrl + k代表整個文本,而不僅僅是第一行。查看編輯過的帖子。 –

回答

0
create external table myfile (rec string) 
row format delimited 
fields terminated by ',' 
tblproperties ('serialization.last.column.takes.rest'='true')  
; 

select  rec 
      ,ifn 
      ,ifn_newfile_seq 

      ,row_number() over 
      (
       partition by ifn_newfile_seq 
       order by  boif 
      ) as ifn_newfile_rec_seq 


from  (select rec 
        ,input__file__name    as ifn 
        ,block__offset__inside__file  as boif 

        ,count(case when rec like 'NEWFILE=%' then 1 end) over 
        (
         partition by input__file__name 
         order by  block__offset__inside__file 
        ) as ifn_newfile_seq 


      from myfile 
      ) l 
; 

+------------------+----------------------------------------------+-----------------+---------------------+ 
| rec    | ifn           | ifn_newfile_seq | ifn_newfile_rec_seq | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| NEWFILE=STUDENT | file:/home/cloudera/local_db/myfile/file.txt | 1    | 1     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 100 XYZ   | file:/home/cloudera/local_db/myfile/file.txt | 1    | 2     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 101 ABC   | file:/home/cloudera/local_db/myfile/file.txt | 1    | 3     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 102 DEF   | file:/home/cloudera/local_db/myfile/file.txt | 1    | 4     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| NEWFILE=SUBJECT | file:/home/cloudera/local_db/myfile/file.txt | 2    | 1     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 1 ENGLISH  | file:/home/cloudera/local_db/myfile/file.txt | 2    | 2     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 2 MATHS   | file:/home/cloudera/local_db/myfile/file.txt | 2    | 3     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| NEWFILE=TEACHERS | file:/home/cloudera/local_db/myfile/file.txt | 3    | 1     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 110 AAAAAAAA  | file:/home/cloudera/local_db/myfile/file.txt | 3    | 2     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 111 BBBBBBB  | file:/home/cloudera/local_db/myfile/file.txt | 3    | 3     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 222 CCCCCCC  | file:/home/cloudera/local_db/myfile/file.txt | 3    | 4     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 333 DDDDDD  | file:/home/cloudera/local_db/myfile/file.txt | 3    | 5     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| NEWFILE=CLASSES | file:/home/cloudera/local_db/myfile/file.txt | 4    | 1     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 1 CLASS-1  | file:/home/cloudera/local_db/myfile/file.txt | 4    | 2     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
| 2 CLASS-2  | file:/home/cloudera/local_db/myfile/file.txt | 4    | 3     | 
+------------------+----------------------------------------------+-----------------+---------------------+ 
+0

感謝Dudu的解決方案。這真的有幫助。我在羣集中遇到問題,無法運行查詢。如有任何問題,我會盡力回覆您。 – Nat

+0

Hi Dudu,上面的解決方案解決了我的問題。 – Nat

+0

偉大的:-)不要忘了接受答案(標記'V'符號留給答案) –