2015-10-13 57 views
0

(First posts!)Apache Pig - 在腳本中多次調用Java UDF ToJSON

我一直在玩一個簡單的例子數據集。簡歷對象有點複雜,有多個子對象。對於我計劃的當前階段,我試圖通過將子對象存儲爲JSON字符串來壓扁數據集。我遇到了ToJSON UDF的模式問題。 (https://github.com/rjurney/pig-to-json

如果我做了下面的語句在我的豬劇本,我得到正確的在我的領域數據,但它重用所有ToJSson()調用位置字段名稱:

stringifiedJSON = 
FOREACH fullJSON 
GENERATE 
id .. TotalYears, 
com.hortonworks.pig.udf.ToJson(Awards) AS Awards:chararray, 
com.hortonworks.pig.udf.ToJson(Certifications) AS Certifications:chararray, 
CASE WHEN Degrees IS NULL THEN ‘[]’ ELSE com.hortonworks.pig.udf.ToJson(Degrees) END AS Degrees:chararray, 
com.hortonworks.pig.udf.ToJson(Links) AS Links:chararray, 
com.hortonworks.pig.udf.ToJson(Groups) AS Groups:chararray, 
com.hortonworks.pig.udf.ToJson(MilitaryService) AS MilitaryService:chararray, 
com.hortonworks.pig.udf.ToJson(Positions) AS Positions:chararray; 

如果我描述了「fullJSON」數據集,這是我得到的回報(「...」其他領域不討論真正相關):

fullJSON: 
{ 
id: chararray, 
.. 
Awards: {award: (AwardDate: chararray,AwardDescription: chararray,AwardTitle: chararray)}, 
Certifications: {certification: (CertDescription: chararray,CertEndDate: chararray,CertStartDate: chararray,CertTitle: chararray)}, 
… 
Degrees: {(DegreeTitle: chararray,DegreeEndDate: chararray,DegreeStartDate: chararray,School: chararray,SchoolCity: chararray,SchoolState: chararray,DegreeEducationLevel: chararray)}, 
… 
Links: {link: (LinkTitle: chararray,LinkURL: chararray)}, 
Groups: {group: (GroupDescription: chararray,GroupEndDate: chararray,GroupStartDate: chararray,GroupTitle: chararray)}, 
… 
MilitaryService: {military_service: (MilitaryBranch: chararray,MilitaryCommendations: chararray,MilitaryCountry: chararray,MilitaryDescripton: chararray,MilitaryStartDate: chararray,MilitaryEndDate: chararray,MilitaryRank: chararray)}, 
… 
Positions: {(Company: chararray,CompanyCity: chararray,CompanyState: chararray,JobStartDate: chararray,JobEndDate: chararray,JobTitle: chararray,IsCurrentTitle: int)}, 
… 
} 

任何人有什麼想法?我嘗試將ToJson()調用分成各自的步驟,但我得到了相同的結果。

我後來玩了一下ToJSON.java的源代碼,我想我已經縮小到了下面的代碼範圍。在此之後,我立即添加了strSchema的日誌輸出,並且它總是返回相同的信息(位置信息)。

if (myProperties == null) { 
    // Retrieve our class specific properties from UDFContext 
    myProperties = UDFContext.getUDFContext().getUDFProperties(this.getClass()); 
    } 

String strSchema = myProperties.getProperty("horton.json.udf.schema"); 

下面是stringifiedJSON輸出的一個樣本:

{ 
    "id":"http://something.com/some_guy", 
    ... 
    "Awards":"[]", 
    "Certifications":"[]", 
    "Degrees":"[{\"CompanyState\":null,\"CompanyCity\":null,\"JobEndDate\":\"\",\"IsCurrentTitle\":\"Bachelor's Degree\",\"JobTitle\":\"\",\"Company\":\"BS in Marketing\",\"JobStartDate\":\"State University\"}]", 
    "Links":"[]", 
    "Groups":"[]", 
    "MilitaryService":"[]", 
    "Positions":"[{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Scottsdale\",\"JobEndDate\":\"2010-03-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Job runner\",\"Company\":\"somecompany\",\"JobStartDate\":\"2005-06-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Scottsdale\",\"JobEndDate\":\"2010-03-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Sales Rep\",\"Company\":\"Company2\",\"JobStartDate\":\"2005-06-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":\"2004-12-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Job 3\",\"Company\":\"Company3\",\"JobStartDate\":\"1991-05-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":\"2004-12-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"CompanyRep\",\"Company\":\"Company4\",\"JobStartDate\":\"1991-05-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":null,\"IsCurrentTitle\":null,\"JobTitle\":\"Job5\",\"Company\":\"Company5\",\"JobStartDate\":\"2014-09-01T00:00:00.000Z\"}]" 
} 

回答

0

這裏就是我清盤做。我會而不是一種完成它的方式,但它的工作原理。我寧願不必在開始時進行7次不同的DEFINE調用,只需調用函數本身並使其正常工作即可。

我添加一個字符串稱爲簽名和構造的類:

String signature = null; 
public ToJson(String Signature) { 
    signature = Signature; 
} 

我改性類的outputSchema()。我加入了簽名的getUDFProperties:)

Properties udfProp = context.getUDFProperties(this.getClass(),new String[]{signature}); 

了,我也修改了EXEC(:

myProperties = UDFContext.getUDFContext().getUDFProperties(this.getClass(),new String[]{signature}); 

然後,在豬腳本本身,我添加了一些DEFINE條款:

DEFINE awardToJson com.hortonworks.pig.udf.ToJson('award'); 
DEFINE certToJson com.hortonworks.pig.udf.ToJson('cert'); 
DEFINE degreeToJson com.hortonworks.pig.udf.ToJson('degree'); 
DEFINE linkToJson com.hortonworks.pig.udf.ToJson('link'); 
DEFINE groupToJson com.hortonworks.pig.udf.ToJson('group'); 
DEFINE militaryToJson com.hortonworks.pig.udf.ToJson('military'); 
DEFINE positionToJson com.hortonworks.pig.udf.ToJson('position'); 

然後我調整了豬腳本中的函數調用:

stringifiedJSON = 
    FOREACH fullJSON 
    GENERATE 
    id .. TotalYears, 
    awardToJson(Awards) AS Awards:chararray, 
    certToJson(Certifications) AS Certifications:chararray, 
    CASE WHEN Degrees IS NULL THEN '[]' ELSE degreeToJson(Degrees) END AS Degrees:chararray, 
    linkToJson(Links) AS Links:chararray, 
    groupToJson(Groups) AS Groups:chararray, 
    militaryToJson(MilitaryService) AS MilitaryService:chararray, 
    positionToJson(Positions) AS Positions:chararray 
    ;