2017-02-24 63 views
3

我有一個配置單元查詢,它使用XPath從XML返回一組數組。 我想將數組的這些元素插入配置單元表中。如何將數據插入XPath返回的數組中的hive表中

在hivexml表XML內容是:

<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag> 

它返回組陣列的該查詢:

select xpath(str,'/tag/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;" 

和上面查詢的輸出(設定陣列)是:

["1","2","3","4","5"] [".net","html","css","php","c"] ["244006","602809","434937","1009113","236386"] ["3624959","3673183","3644670","3624936","3624961"] ["3607476","36 
73182","3644669","3607050","3607013"] 

我想插入這些值到一個配置單元表中,就像在這種格式:

1 .net 244006  3624959 3607476 
2 html 602809  3673183 3673182 
3 css  434937  3644670 3644669 
4 php  1009113 3624936 3607050 
5 c  236386  3624961 3607013 

如果我做一個插入上述選擇查詢:

insert into newhivexml select xpath(str,'/tags/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;" 

然後我得到一個錯誤:

NoMatchingMethodException No matching method for class org.apache.hadoop.hive.ql.udf.UDFToInteger with (array). Possible choices: FUNC(bigint) FUNC(boolean) FU NC(decimal(38,18)) FUNC(double) FUNC(float) FUNC(smallint) FUNC(string) FUNC(struct) FUNC(timestamp) FUNC(tinyin t) FUNC(void)

我認爲,我們不能直接插入這樣的,我在這裏失去了一些東西。誰能告訴我如何做到這一點?也就是說,將數組中的這些值插入到表中。

+0

下載只是爲了確保 - 的XML剛剛開始列中的e列,而不是整個數據,對不對? –

回答

2

的XPath _...(STR,CONCAT(「/標籤/行[」,pe.pos +1,']/@ ......))

create table hivexml (str string); 

insert into hivexml values ('<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>'); 

select xpath_int (str,concat('/tag/row[',pe.pos+1,']/@Id'   )) as Id 
     ,xpath_string (str,concat('/tag/row[',pe.pos+1,']/@TagName'  )) as TagName 
     ,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@Count'  )) as Count 
     ,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@ExcerptPostId')) as ExcerptPostId 
     ,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@WikiPostId' )) as WikiPostId 

from hivexml 
     lateral view posexplode (xpath(str,'/tag/row/@Id')) pe 
; 

+----+------------+---------+---------------+------------+ 
| id | tagname | count | excerptpostid | wikipostid | 
+----+------------+---------+---------------+------------+ 
| 1 | .net  | 244006 |  3624959 | 3607476 | 
| 2 | html  | 602809 |  3673183 | 3673182 | 
| 3 | javascript | 1274350 |  3624960 | 3607052 | 
| 4 | css  | 434937 |  3644670 | 3644669 | 
| 5 | php  | 1009113 |  3624936 | 3607050 | 
| 8 | c   | 236386 |  3624961 | 3607013 | 
+----+------------+---------+---------------+------------+ 
+0

感謝它的工作!但一個小故障是我們不能把換行符在查詢中..它顯示錯誤「該命令的語法不正確。」。如果我把所有東西放在一條線上,它就可以工作! –

0

問題是,XPath函數將返回所有匹配結果,每個請求在獨立數組中都不加入。如果它適合你,你可以使用豬八戒這個批處理模式可以簡化過程分解爲單個步驟:

REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll(); 

A = LOAD '/tmp/text.xml' using org.apache.pig.piggybank.storage.XMLLoader('tag') as (x:chararray); 

B = FOREACH A GENERATE XPathAll(x, 'row/@Id',false,false).$0, 
    XPathAll(x, 'row/@TagName',false,false).$0, 
    XPathAll(x, 'row/@Count',false,false).$0, 
    XPathAll(x, 'row/@ExcerptPostId',false,false).$0, 
    XPathAll(x, 'row/@WikiPostId',false,false).$0; 

DUMP B; 

(1,.net,244006,3624959,3607476) 
(2,html,602809,3673183,3673182) 
(3,javascript,1274350,3624960,3607052) 
(4,css,434937,3644670,3644669) 
(5,php,1009113,3624936,3607050) 
(8,c,236386,3624961,3607013) 

STORE B INTO "YourTable" USING org.apache.hive.hcatalog.pig.HCatStorer(); 
1

xpath(str,concat('/ tag/row [',pe.pos + 1,']/@ *'))

這是一個非常乾淨的方式來提取一個元素的所有值。
它的屬性的順序似乎沒有什麼在這裏讓我吃驚不將根據XML內,但通過他們的名字字母順序排列的順序 -
@伯爵,@ ExcerptPostId,@標識,@標記名@ WikiPostId

不幸的是,我不能認爲它是一個合法的解決方案,除非我知道字母屬性順序是有保證的。

select xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values 

from hivexml 
     lateral view posexplode (xpath(str,'/tag/row/@Id')) pe 
; 

-

["244006","3624959","1",".net","3607476"] 
["602809","3673183","2","html","3673182"] 
["1274350","3624960","3","javascript","3607052"] 
["434937","3644670","4","css","3644669"] 
["1009113","3624936","5","php","3607050"] 
["236386","3624961","8","c","3607013"] 

select row_values[2] as Id 
     ,row_values[3] as TagName 
     ,row_values[0] as Count  
     ,row_values[1] as ExcerptPostId 
     ,row_values[4] as WikiPostId 

from (select xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values 

     from hivexml 
       lateral view posexplode (xpath(str,'/tag/row/@Id')) pe 
     ) x 
; 

+----+------------+---------+---------------+------------+ 
| id | tagname | count | excerptpostid | wikipostid | 
+----+------------+---------+---------------+------------+ 
| 1 | .net  | 244006 |  3624959 | 3607476 | 
| 2 | html  | 602809 |  3673183 | 3673182 | 
| 3 | javascript | 1274350 |  3624960 | 3607052 | 
| 4 | css  | 434937 |  3644670 | 3644669 | 
| 5 | php  | 1009113 |  3624936 | 3607050 | 
| 8 | c   | 236386 |  3624961 | 3607013 | 
+----+------------+---------+---------------+------------+ 
+1

你是真正的蜂巢大師。甚至沒有想象過這樣的事情可以通過Hive在單個查詢中完成。 +1爲每個解決方案 – Alex

1

分裂+ str_to_map

select vals["Id"]    as Id 
     ,vals["TagName"]   as TagName 
     ,vals["Count"]   as Count  
     ,vals["ExcerptPostId"] as ExcerptPostId 
     ,vals["WikiPostId"]  as WikiPostId 

from (select str_to_map(e.val,' ','=') as vals 

     from hivexml 
       lateral view posexplode(split(translate(str,'"',''),'/?><row')) e 

     where e.pos <> 0 
     ) x 
; 

+----+------------+---------+---------------+------------+ 
| id | tagname | count | excerptpostid | wikipostid | 
+----+------------+---------+---------------+------------+ 
| 1 | .net  | 244006 |  3624959 | 3607476 | 
| 2 | html  | 602809 |  3673183 | 3673182 | 
| 3 | javascript | 1274350 |  3624960 | 3607052 | 
| 4 | css  | 434937 |  3644670 | 3644669 | 
| 5 | php  | 1009113 |  3624936 | 3607050 | 
| 8 | c   | 236386 |  3624961 | 3607013 | 
+----+------------+---------+---------------+------------+ 
1

如果數據是XML文檔

XML SERDE可以從https://github.com/01org/graphbuilder/blob/master/src/com/intel/hadoop/graphbuilder/preprocess/inputformat/XMLInputFormat.java

add jar /home/cloudera/hivexmlserde-1.0.5.3.jar; 

create external table hivexml_ext 
(
    Id    string 
    ,TagName   string 
    ,Count   string 
    ,ExcerptPostId string 
    ,WikiPostId  string 
) 
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
with serdeproperties 
(
    "column.xpath.Id"   = "/row/@Id" 
    ,"column.xpath.TagName"  = "/row/@TagName" 
    ,"column.xpath.Count"   = "/row/@Count " 
    ,"column.xpath.ExcerptPostId" = "/row/@ExcerptPostId" 
    ,"column.xpath.WikiPostId" = "/row/@WikiPostId" 
) 
stored as 
inputformat  'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
location  '/user/hive/warehouse/hivexml' 
tblproperties 
(
    "xmlinput.start" = "<row" 
    ,"xmlinput.end" = "/>" 
) 
; 

select * from hivexml_ext as x 
; 

+------+------------+---------+-----------------+--------------+ 
| x.id | x.tagname | x.count | x.excerptpostid | x.wikipostid | 
+------+------------+---------+-----------------+--------------+ 
| 1 | .net  | 244006 |   3624959 |  3607476 | 
| 2 | html  | 602809 |   3673183 |  3673182 | 
| 3 | javascript | 1274350 |   3624960 |  3607052 | 
| 4 | css  | 434937 |   3644670 |  3644669 | 
| 5 | php  | 1009113 |   3624936 |  3607050 | 
| 8 | c   | 236386 |   3624961 |  3607013 | 
+------+------------+---------+-----------------+--------------+ 
+0

我沒有Java在我的電腦..將上面的代碼運行在PowerShell中,如果我複製它,因爲它是?我擔心添加jar文件的第一行。 –

+0

在您下載jar之後,應該在配置單元中執行'add jar'命令。把瓶子放在你喜歡的任何地方,並相應地改變路徑。 –

+0

該jar文件應該在我的本地機器或天青?我已經把它放在我的本地機器上,但它的顯示文件不存在。 –

相關問題