2012-03-21 58 views
1

斯坦福分析器(http://nlp.stanford.edu/software/lex-parser.shtml)給出瞭如下的無上下文短語結構樹。提取樹中所有名詞短語(NP)和動詞短語(NP)的最佳方法是什麼?有沒有任何Python(或Java)庫可以讓我閱讀這些結構?謝謝。從斯坦福分析器輸出的上下文無關短語結構中提取信息

(ROOT 
    (S 
    (S 
     (NP 
     (NP (DT The) (JJS strongest) (NN rain)) 
     (VP 
      (ADVP (RB ever)) 
      (VBN recorded) 
      (PP (IN in) 
      (NP (NNP India))))) 
     (VP 
     (VP (VBD shut) 
      (PRT (RP down)) 
      (NP 
      (NP (DT the) (JJ financial) (NN hub)) 
      (PP (IN of) 
       (NP (NNP Mumbai))))) 
     (, ,) 
     (VP (VBD snapped) 
      (NP (NN communication) (NNS lines))) 
     (, ,) 
     (VP (VBD closed) 
      (NP (NNS airports))) 
     (CC and) 
     (VP (VBD forced) 
      (NP 
      (NP (NNS thousands)) 
      (PP (IN of) 
       (NP (NNS people)))) 
      (S 
      (VP (TO to) 
       (VP 
       (VP (VB sleep) 
        (PP (IN in) 
        (NP (PRP$ their) (NNS offices)))) 
       (CC or) 
       (VP (VB walk) 
        (NP (NN home)) 
        (PP (IN during) 
        (NP (DT the) (NN night)))))))))) 
    (, ,) 
    (NP (NNS officials)) 
    (VP (VBD said) 
     (NP-TMP (NN today))) 
    (. .))) 

回答

2

查看自然語言工具包(NLTK)nltk.org

該工具包是用Python編寫的,它提供了精確讀取這些樹(以及其他許多東西)的代碼。

或者,您可以編寫自己的遞歸函數來執行此操作。這將是非常簡單的。


只是爲了好玩:這裏是一個超級簡單的實現你想要什麼:

def parse(): 
    itr = iter(filter(lambda x: x, re.split("\\s+", s.replace('(', ' (').replace(')', ') ')))) 

    def _parse(): 
    stuff = [] 
    for x in itr: 
     if x == ')': 
     return stuff 
     elif x == '(': 
     stuff.append(_parse()) 
     else: 
     stuff.append(x) 
    return stuff 

    return _parse()[0] 

def find(parsed, tag): 
    if parsed[0] == tag: 
    yield parsed 
    for x in parsed[1:]: 
    for y in find(x, tag): 
     yield y 

p = parse() 
np = find(p, 'NP') 
for x in np: 
    print x 

產量:

['NP', ['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']], ['VP', ['ADVP', ['RB', 'ever']], ['VBN', 'recorded'], ['PP', ['IN', 'in'], ['NP', ['NNP', 'India']]]]] 
['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']] 
['NP', ['NNP', 'India']] 
['NP', ['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']], ['PP', ['IN', 'of' ['NP', ['NNP', 'Mumbai']]]] 
['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']] 
['NP', ['NNP', 'Mumbai']] 
['NP', ['NN', 'communication'], ['NNS', 'lines']] 
['NP', ['NNS', 'airports']] 
['NP', ['NP', ['NNS', 'thousands']], ['PP', ['IN', 'of'], ['NP', ['NNS', 'people']]]] 
['NP', ['NNS', 'thousands']] 
['NP', ['NNS', 'people']] 
['NP', ['PRP$', 'their'], ['NNS', 'offices']] 
['NP', ['NN', 'home']] 
['NP', ['DT', 'the'], ['NN', 'night']] 
['NP', ['NNS', 'officials']]