2017-10-12 97 views
1

我有一個名爲geo_data_display的字段,其中包含國家,地區和dma。這3個值包含在第一個「=」和第一個「&」之間的字符,第二個「=」和第二個「&」之間的區域和第三個「=」和第三個「=」之間的DMA之間的國家, &「。這是一個可重新生成的表格。國家總是字符,但地區和DMA可以是數字或字符和DMA不存在所有國家。使用配置單元SQL提取不同字符之間的字符串

幾個樣本值是:

country=us&region=tx&dma=625&domain=abc.net&zipcodes=76549 
country=us&region=ca&dma=803&domain=abc.com&zipcodes=90404 
country=tw&region=hsz&domain=hinet.net&zipcodes=300 
country=jp&region=1&dma=a&domain=hinet.net&zipcodes=300 

我有一些樣本SQL但geo_dma行代碼不是在所有的工作和GEO_REGION代碼行僅適用於字符值

SELECT 

UPPER(REGEXP_REPLACE(split(geo_data_display, '\\&')[0], 'country=', '')) AS geo_country 
,UPPER(split(split(geo_data_display, '\\&')[1],'\\=')[1]) AS geo_region 
,split(split(cast(geo_data_display as int), '\\&')[2],'\\=')[2] AS geo_dma 
FROM mytable 

回答

0

Source

regexp_extract(string subject, string pattern, int index)

返回使用模式提取的字符串。例如,REGEXP_EXTRACT( 'foothebar', '富(。*?)(巴)',1)返回 '的'

select 
     regexp_extract(geo_data_display, 'country=(.*?)(&region)', 1), 
     regexp_extract(geo_data_display, 'region=(.*?)(&dma)', 1), 
     regexp_extract(geo_data_display, 'dma=(.*?)(&domain)', 1) 
+0

完美,謝謝! –

+0

當DMA不存在時,過度複雜並返回錯誤結果。 –

2

str_to_map

select geo_map['country'] as geo_country 
     ,geo_map['region'] as geo_region 
     ,geo_map['dma']  as geo_dma 

from (select str_to_map(geo_data_display,'&','=') as geo_map 
     from mytable 
     ) t 
; 

+--------------+-------------+----------+ 
| geo_country | geo_region | geo_dma | 
+--------------+-------------+----------+ 
| us   | tx   | 625  | 
| us   | ca   | 803  | 
| tw   | hsz   | NULL  | 
| jp   | 1   | a  | 
+--------------+-------------+----------+ 
0

請嘗試以下,

create table ch8(details map string,string>) 

row format delimited 

collection items terminated by '&' 

map keys terminated by '='; 

將數據加載到表中。

create another table using CTAS 

create table ch9 as select details["country"] as country, details["region"] as region, details["dma"] as dma, details["domain"] as domain, details["zipcodes"] as zipcode from ch8; 

Select * from ch9; 
相關問題