2017-09-25 46 views
0

假設我有以下的表格,檢查字計數字符串和更少的計數刪除的話 - 蜂巢

date_part    string_word       id 
2017-08-08  India America Advance Apartments   1 
2017-08-08  Apartments Planner Headlines    1 
2017-08-08  India America Headlines Gucci    1 
2017-08-08  Images Same Thing Africa     2 
2017-08-08  Images          2 
2017-08-07  India America Advance Apartments   2 
2017-08-07  Apartments Planner Headlines    3 
2017-08-07  India America Headlines Gucci    3 
2017-08-07  Images Same Thing Africa     3 
2017-08-07  Images          4 

現在我想找到字數每天和刪除的話數量較少。爲了找到字數,我寫了下面的查詢,

SELECT date_part, word, COUNT(*) as total_word_count 
FROM table_name LATERAL VIEW explode(split(string_word, ' ')) lTable as word 
where date_part > '2017-08-05' 
GROUP BY date_part, word 

這將給以下,

date_part  word  total_word_count 
2017-08-08  India   2 
2017-08-08  America   2 
2017-08-08  Advance   1 
2017-08-08  Apartments  2 
2017-08-08  Planner   1 
2017-08-08  Headlines  2 
2017-08-08  Gucci   1 
2017-08-08  Images   2 
2017-08-08  Same    1 
2017-08-08  Thing   1 
2017-08-08  Africa   1 
2017-08-07  India   2 
2017-08-07  America   2 
2017-08-07  Advance   1 
2017-08-07  Apartments  2 
2017-08-07  Planner   1 
2017-08-07  Headlines  2 
2017-08-07  Gucci   1 
2017-08-07  Images   2 
2017-08-07  Same    1 
2017-08-07  Thing   1 
2017-08-07  Africa   1 

現在我想用計數刪除的話小於2,即用1字應該在每個日期刪除計數。以下應該是輸出,

date_part    string_word       id 
2017-08-08  India America Apartments     1 
2017-08-08  Apartments Headlines      1 
2017-08-08  India America Headlines     1 
2017-08-08  Images          2 
2017-08-08  Images          2 
2017-08-07  India America Apartments     2 
2017-08-07  Apartments Headlines      3 
2017-08-07  India America Headlines     3 
2017-08-07  Images          3 
2017-08-07  Images          4 

這裏帶有1計數的單詞已被刪除。這是我期望得到的輸出,這也是每天都要做的。

有人可以幫我做這件事嗎?

感謝

+0

加上'HAVING total_word_count> 1'到查詢... –

+0

@usagi過濾是罰款。但是我想從原始表格中刪除單詞。只有一個以上的計數應該存在。剩下的話應該刪除。這就是我正在看的問題 – haimen

回答

0
select  t.date_part 
      ,regexp_replace(t.string_word,concat('\\s?\\b(',e.words,')\\b'),'') as string_word 
      ,t.id 

from     table_name as t 

      join  (select  date_part 
            ,concat_ws('|',collect_list (col)) as words 

         from  (select  date_part 
               ,e.col 

            from  table_name t 
               lateral view explode(split(t.string_word,'\\s+')) e 

            group by date_part 
               ,e.col 

            having  count(*) = 1 
            ) e 

         group by date_part 
         ) e 

      on   e.date_part = 
         t.date_part 
; 

+-------------+---------------------------+-----+ 
| date_part |  string_word  | id | 
+-------------+---------------------------+-----+ 
| 2017-08-07 | India America Apartments | 2 | 
| 2017-08-07 | Apartments Headlines  | 3 | 
| 2017-08-07 | India America Headlines | 3 | 
| 2017-08-07 | Images     | 3 | 
| 2017-08-07 | Images     | 4 | 
| 2017-08-08 | India America Apartments | 1 | 
| 2017-08-08 | Apartments Headlines  | 1 | 
| 2017-08-08 | India America Headlines | 1 | 
| 2017-08-08 | Images     | 2 | 
| 2017-08-08 | Images     | 2 | 
+-------------+---------------------------+-----+