2017-09-24 74 views
0

SpaCy句子已經實施了sense2vec字的嵌入包,其中他們的文件here如何標記爲spacy的Sence2vec實施

的載體是所有形式WORD|POS的。例如,句子

Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of trouble 

需要被轉換成

Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT I|PRON think|VERB effects|NOUN computers|NOUN have|VERB on|ADP people|NOUN are|VERB great|ADJ learning|NOUN skills/affects|NOUN because|ADP they|PRON give|VERB us|PRON time|NOUN to|PART chat|VERB with|ADP friends/new|ADJ people|NOUN ,|PUNCT helps|VERB us|PRON learn|VERB about|ADP the|DET globe(astronomy|NOUN)|PUNCT and|CONJ keeps|VERB us|PRON out|ADP of|ADP trouble|NOUN !|PUNCT 

爲了通過sense2vec預訓練的嵌入,並且爲了要在sense2vec格式是可解釋的。

這怎麼辦?

回答

0

基於關閉的SpaCy's bin/merge.py實現這確實是需要的正是:

from spacy.en import English 
import re 

LABELS = { 
    'ENT': 'ENT', 
    'PERSON': 'ENT', 
    'NORP': 'ENT', 
    'FAC': 'ENT', 
    'ORG': 'ENT', 
    'GPE': 'ENT', 
    'LOC': 'ENT', 
    'LAW': 'ENT', 
    'PRODUCT': 'ENT', 
    'EVENT': 'ENT', 
    'WORK_OF_ART': 'ENT', 
    'LANGUAGE': 'ENT', 
    'DATE': 'DATE', 
    'TIME': 'TIME', 
    'PERCENT': 'PERCENT', 
    'MONEY': 'MONEY', 
    'QUANTITY': 'QUANTITY', 
    'ORDINAL': 'ORDINAL', 
    'CARDINAL': 'CARDINAL' 
} 



nlp = False; 
def tag_words_in_sense2vec_format(passage): 
    global nlp; 
    if(nlp == False): nlp = English() 
    if isinstance(passage, str): passage = passage.decode('utf-8',errors='ignore'); 
    doc = nlp(passage); 
    return transform_doc(doc); 

def transform_doc(doc): 
    for index, ent in enumerate(doc.ents): 
     ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_]) 
     #if index % 100 == 0: print ("enumerating at entity index " + str(index)); 
    #for np in doc.noun_chunks: 
    # while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'): 
    #  np = np[1:] 
    # np.merge(np.root.tag_, np.text, np.root.ent_type_) 
    strings = [] 
    for index, sent in enumerate(doc.sents): 
     if sent.text.strip(): 
      strings.append(' '.join(represent_word(w) for w in sent if not w.is_space)) 
     #if index % 100 == 0: print ("converting at sentence index " + str(index)); 
    if strings: 
     return '\n'.join(strings) + '\n' 
    else: 
     return '' 
def represent_word(word): 
    if word.like_url: 
     return '%%URL|X' 
    text = re.sub(r'\s', '_', word.text) 
    tag = LABELS.get(word.ent_type_, word.pos_) 
    if not tag: 
     tag = '?' 
    return text + '|' + tag 

print(tag_words_in_sense2vec_format("Dear local newspaper, ...")) 

結果

Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT ...