2016-12-15 69 views
-1

我有兩組數據的文件象下面這樣:如何使用apache豬將一個包變成多個包?

1,abc,10,dss 
2,efgh,as 
1,abc,10,1234 
2,efgh,as 
1,abc,10,7899 
2,efgh,as 

記錄,從第1是一組和以#開始2記錄是不同集合。所以兩者都有不同的結構。如何區分這兩組記錄?

回答

0

這裏是一個辦法......

A = LOAD '/user/data/split.txt' as line:chararray; 
B = FOREACH A GENERATE Flatten(TOKENIZE(line,' ')) ; 
B1 = filter B by $0 matches '1.*'; 
B2 = filter B by $0 matches '2.*'; 
DUMP B1 
DUMP B2 
or 
SPLIT B INTO B1 IF ($0 matches '1.*'), B2 IF ($0 matches '2.*'); 
+0

嗨,我得到下面的錯誤: – Ram

+0

嗨,我得到以下錯誤:2016-12-16 10:32:17,936 [main] ERROR org.apache.pig.tools.grunt.Grunt - 錯誤1025: <第3行,第46列>無效的字段投影。投影的字段[行]不存在。 我跑了下面的代碼:grunt> file2 = LOAD'/hls/hls_wi/training/twofile.csv'USING PigStorage(','); grunt> B = FOREACH file2 GENERATE Flatten(TOKENIZE(line,'')); – Ram

+0

您正在使用「使用PigStorage(',');」就地使用「作爲行:chararray;」和TOKENIZE並確認您的輸入格式。 –

0

隨着輸入的新的更新版本,這裏是其它解決方案

A = LOAD '/user/data/split.txt' as line:chararray; 
B1 = filter A by $0 matches '1.*'; 
B2 = filter A by $0 matches '2.*'; 
or 
SPLIT A INTO B1 IF ($0 matches '1.*'), B2 IF ($0 matches '2.*');