2016-07-26 62 views
0

這是我做了豬來標記, 我的豬腳本符號化豬(使用python UDF)

--set the debugg mode 
SET debug 'off' 
-- Registering the python udf 
REGISTER /home/hema/phd/work1/coding/myudf.py USING streaming_python as myudf 

RAWDATA =LOAD '/home/hema/temp' USING TextLoader() AS content; 
LOWERCASE_DATA =FOREACH RAWDATA GENERATE LOWER(content) AS con; 
TOKENIZED_DATA =FOREACH LOWERCASE_DATA GENERATE myudf.special_tokenize(con) as conn; 
DUMP TOKENIZED_DATA; 

我的Python UDF

from pig_util import outputSchema 
import nltk 

@outputSchema('word:chararray') 
def special_tokenize(input): 
    tokens=nltk.word_tokenize(input) 
    return tokens 

代碼工作很好,但輸出很混亂。我如何刪除不需要的下劃線和垂直條。輸出看起來像這樣

(|{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_) 
(|{_|(_is|)_|,_|(_there|)_|,_|(_any|)_|,_|(_possibility|)_|,_|(_to|)_|,_|(_use|)_|,_|(_additionalcontext|)_|,_|(_with|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_?|)_|,_|(_if|)_|,_|(_so|)_|,_|(_,|)_|,_|(_how|)_|,_|(_?|)_|,_|(_if|)_|,_|(_there|)_|,_|(_is|)_|,_|(_n't|)_|,_|(_maybe|)_|,_|(_this|)_|,_|(_should|)_|,_|(_be|)_|,_|(_an|)_|,_|(_issue|)_|,_|(_to|)_|,_|(_be|)_|,_|(_added|)_|,_|(_in|)_|,_|(_the|)_|,_|(_future|)_|,_|(_releases|)_|,_|(_?|)_|}_) 
(|{_|(_i|)_|,_|(_would|)_|,_|(_really|)_|,_|(_greatly|)_|,_|(_appreciate|)_|,_|(_if|)_|,_|(_someone|)_|,_|(_can|)_|,_|(_help|)_|,_|(_(|)_|,_|(_give|)_|,_|(_me|)_|,_|(_some|)_|,_|(_sample|)_|,_|(_code/show|)_|,_|(_me|)_|,_|(_)|)_|,_|(_how|)_|,_|(_to|)_|,_|(_add|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_features|)_|,_|(_while|)_|,_|(_training|)_|,_|(_and|)_|,_|(_testing|)_|,_|(_namefinder|)_|,_|(_.|)_|}_) 
(|{_|(_if|)_|,_|(_the|)_|,_|(_incoming|)_|,_|(_data|)_|,_|(_is|)_|,_|(_just|)_|,_|(_tokens|)_|,_|(_with|)_|,_|(_no|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_information|)_|,_|(_,|)_|,_|(_where|)_|,_|(_is|)_|,_|(_the|)_|,_|(_information|)_|,_|(_taken|)_|,_|(_then|)_|,_|(_?|)_|,_|(_a|)_|,_|(_new|)_|,_|(_file|)_|,_|(_?|)_|,_|(_run|)_|,_|(_a|)_|,_|(_pos|)_|,_|(_tagging|)_|,_|(_model|)_|,_|(_before|)_|,_|(_training|)_|,_|(_?|)_|,_|(_or|)_|,_|(_?|)_|}_) 
(|{_|(_and|)_|,_|(_what|)_|,_|(_is|)_|,_|(_the|)_|,_|(_purpose|)_|,_|(_of|)_|,_|(_the|)_|,_|(_resources|)_|,_|(_(|)_|,_|(_i.e|)_|,_|(_.|)_|,_|(_collection.|)_|,_|(_<|)_|,_|(_string|)_|,_|(_,|)_|,_|(_object|)_|,_|(_>|)_|,_|(_emptymap|)_|,_|(_(|)_|,_|(_)|)_|,_|(_)|)_|,_|(_in|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_method|)_|,_|(_?|)_|,_|(_what|)_|,_|(_should|)_|,_|(_be|)_|,_|(_ideally|)_|,_|(_included|)_|,_|(_in|)_|,_|(_there|)_|,_|(_?|)_|}_) 
(|{_|(_i|)_|,_|(_just|)_|,_|(_ca|)_|,_|(_n't|)_|,_|(_get|)_|,_|(_these|)_|,_|(_things|)_|,_|(_from|)_|,_|(_the|)_|,_|(_java|)_|,_|(_doc|)_|,_|(_api|)_|,_|(_.|)_|}_) 
(|{_|(_in|)_|,_|(_advance|)_|,_|(_!|)_|}_) 
(|{_|(_best|)_|,_|(_,|)_|}_) 
(|{_|(_svetoslav|)_|}_) 

原始數據

AdditionalContext in NameFinder 
Is there any possibility to use additionalContext with the NameFinderME.train? If so, how? If there isn't maybe this should be an issue to be added in the future releases? 
I would REALLY greatly appreciate if someone can help (give me some sample code/show me) how to add POS tag features while training and testing NameFinder. 
If the incoming data is just tokens with NO POS tag information, where is the information taken then? A new file? Run a POS tagging model before training? Or? 
And what is the purpose of the resources (i.e. Collection.<String,Object>emptyMap()) in the NameFinderME.train method? What should be ideally included in there? 
I just can't get these things from the Java doc API. 
in advance! 
Best, 
Svetoslav 

我想有提前令牌作爲我的最後output.thanks的列表。

+0

@ cricket_007我已經發布我的原始數據作爲編輯。我不認爲NLTK正在生成下劃線和豎線。當我在grunt shell中執行時,同樣的Word_tokenize()方法可以正常工作。 –

+0

好吧,繼發問題。你期望的輸出是什麼? (和旁註:你傳遞字符串到python,那麼爲什麼額外的map-reduce工作來小寫pig中的字符串呢?) –

+0

我期待一個標記字符串的元組作爲輸出。例如('additionalcontext','in','namefinder')。實際上我想在豬身上做所有的預處理。豬中的內置函數(tokenize)並不能標記我喜歡的方式,這就是爲什麼我想使用NLTK。 –

回答

0
from pig_util import outputSchema 
import nltk 
import re 

@outputSchema('word:chararray') 
def special_tokenize(input): 
    #splitting camel-case here 
    temp_data = re.sub(r'(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])'," ",input) 
    tokens = nltk.word_tokenize(temp_data.encode('utf-8')) 
    final_token = ','.join(tokens) 
    return final_token 

有一些問題與輸入的編碼。將其改爲utf-8解決了這個問題。

0

將REPLACE用於'_'和'|'然後使用TOKENIZE標記。

NEW_TOKENIZED_DATA =FOREACH TOKENINZED_DATA GENERATE REPLACE(REPLACE($0,'_',''),'|',''); 
TOKENS = FOREACH NEW_TOKENIZED_DATA GENERATE TOKENIZE($0); 
DUMP TOKENS; 
+0

你是否要我標記兩次。爲什麼這些下劃線和垂直條進入由udf返回的包中。不能替代它們並進行第二輪標記化。 –

+0

我不確定它來自哪裏。您可以輕鬆標記輸入以獲取令牌而不是UDF。我發佈的腳本是用於輸出的,而不是原始數據。 –

+0

內置的標記器不會以我喜歡的方式標記文本。這就是爲什麼我使用UDF和NLTK –