2015-04-17 63 views
1

我正在嘗試查找python中標籤內的所有字符。以下是我的代碼:使用re在多個標籤中查找文本

import re 

text=''' <parse>(ROOT 
     (S 
     (NP (NNP Stanford) (NNP University)) 
     (VP (VBZ is) 
     (ADJP (JJ located) 
     (PP (IN in) 
     (NP (NNP California))))) 
     (. .))) 

     </parse> 
    <dependencies type="basic-dependencies"> 
     <dep type="root"> 
     <governor idx="0">ROOT</governor> 
     <dependent idx="4">located</dependent> 
     </dep> 
     <dep type="nn"> 
     <governor idx="2">University</governor> 
     <dependent idx="1">Stanford</dependent> 
     </dep> 
     <dep type="nsubj"> 
     <governor idx="4">located</governor> 
     <dependent idx="2">University</dependent> 
     </dep> 
     <dep type="cop"> 
     <governor idx="4">located</governor> 
     <dependent idx="3">is</dependent> 
     </dep> 
     <dep type="prep"> 
     <governor idx="4">located</governor> 
     <dependent idx="5">in</dependent> 
     </dep> 
     <dep type="pobj"> 
     <governor idx="5">in</governor> 
     <dependent idx="6">California</dependent> 
     </dep> 
    </dependencies> 
    <dependencies type="collapsed-dependencies"> 
     <dep type="root"> 
     <governor idx="0">ROOT</governor> 
     <dependent idx="4">located</dependent> 
     </dep> 
     <dep type="nn"> 
     <governor idx="2">University</governor> 
     <dependent idx="1">Stanford</dependent> 
     </dep> 
     <dep type="nsubj"> 
     <governor idx="4">located</governor> 
     <dependent idx="2">University</dependent> 
     </dep> 
     <dep type="cop"> 
     <governor idx="4">located</governor> 
     <dependent idx="3">is</dependent> 
     </dep> 
     <dep type="prep_in"> 
     <governor idx="4">located</governor> 
     <dependent idx="6">California</dependent> 
     </dep> 
    </dependencies> 
    <dependencies type="collapsed-ccprocessed-dependencies"> 
     <dep type="root"> 
     <governor idx="0">ROOT</governor> 
     <dependent idx="4">located</dependent> 
     </dep> 
     <dep type="nn"> 
     <governor idx="2">University</governor> 
     <dependent idx="1">Stanford</dependent> 
     </dep> 
     <dep type="nsubj"> 
     <governor idx="4">located</governor> 
     <dependent idx="2">University</dependent> 
     </dep> 
     <dep type="cop"> 
     <governor idx="4">located</governor> 
     <dependent idx="3">is</dependent> 
     </dep> 
     <dep type="prep_in"> 
     <governor idx="4">located</governor> 
     <dependent idx="6">California</dependent> 
     </dep> 
    </dependencies> 
    </sentence> 
    <sentence id="2"> 
    <tokens> 
     <token id="1"> 
     <word>It</word> 
     <lemma>it</lemma> 
     <CharacterOffsetBegin>46</CharacterOffsetBegin> 
     <CharacterOffsetEnd>48</CharacterOffsetEnd> 
     <POS>PRP</POS> 
     <NER>O</NER> 
     </token> 
     <token id="2"> 
     <word>is</word> 
     <lemma>be</lemma> 
     <CharacterOffsetBegin>49</CharacterOffsetBegin> 
     <CharacterOffsetEnd>51</CharacterOffsetEnd> 
     <POS>VBZ</POS> 
     <NER>O</NER> 
     </token> 
     <token id="3"> 
     <word>a</word> 
     <lemma>a</lemma> 
     <CharacterOffsetBegin>52</CharacterOffsetBegin> 
     <CharacterOffsetEnd>53</CharacterOffsetEnd> 
     <POS>DT</POS> 
     <NER>O</NER> 
     </token> 
     <token id="4"> 
     <word>great</word> 
     <lemma>great</lemma> 
     <CharacterOffsetBegin>54</CharacterOffsetBegin> 
     <CharacterOffsetEnd>59</CharacterOffsetEnd> 
     <POS>JJ</POS> 
     <NER>O</NER> 
     </token> 
     <token id="5"> 
     <word>university</word> 
     <lemma>university</lemma> 
     <CharacterOffsetBegin>60</CharacterOffsetBegin> 
     <CharacterOffsetEnd>70</CharacterOffsetEnd> 
     <POS>NN</POS> 
     <NER>O</NER> 
     </token> 
     <token id="6"> 
     <word>,</word> 
     <lemma>,</lemma> 
     <CharacterOffsetBegin>70</CharacterOffsetBegin> 
     <CharacterOffsetEnd>71</CharacterOffsetEnd> 
     <POS>,</POS> 
     <NER>O</NER> 
     </token> 
     <token id="7"> 
     <word>founded</word> 
     <lemma>found</lemma> 
     <CharacterOffsetBegin>72</CharacterOffsetBegin> 
     <CharacterOffsetEnd>79</CharacterOffsetEnd> 
     <POS>VBN</POS> 
     <NER>O</NER> 
     </token> 
     <token id="8"> 
     <word>in</word> 
     <lemma>in</lemma> 
     <CharacterOffsetBegin>80</CharacterOffsetBegin> 
     <CharacterOffsetEnd>82</CharacterOffsetEnd> 
     <POS>IN</POS> 
     <NER>O</NER> 
     </token> 
     <token id="9"> 
     <word>1891</word> 
     <lemma>1891</lemma> 
     <CharacterOffsetBegin>83</CharacterOffsetBegin> 
     <CharacterOffsetEnd>87</CharacterOffsetEnd> 
     <POS>CD</POS> 
     <NER>DATE</NER> 
     <NormalizedNER>1891</NormalizedNER> 
     <Timex tid="t1" type="DATE">1891</Timex> 
     </token> 
     <token id="10"> 
     <word>.</word> 
     <lemma>.</lemma> 
     <CharacterOffsetBegin>87</CharacterOffsetBegin> 
     <CharacterOffsetEnd>88</CharacterOffsetEnd> 
     <POS>.</POS> 
     <NER>O</NER> 
     </token> 
    </tokens> 
    <parse>(ROOT 
     (S 
     (NP (PRP It)) 
     (VP (VBZ is) 
     (NP 
     (NP (DT a) (JJ great) (NN university)) 
     (, ,) 
     (VP (VBN founded) 
     (PP (IN in) 
     (NP (CD 1891)))))) 
     (. .))) 

     </parse> 
    <dependencies type="basic-dependencies"> 
     <dep type="root"> 
     <governor idx="0">ROOT</governor> 
     <dependent idx="5">university</dependent> 
     </dep> 
     <dep type="nsubj"> 
     <governor idx="5">university</governor> 
     <dependent idx="1">It</dependent> 
     </dep> 
     <dep type="cop"> 
     <governor idx="5">university</governor> 
     <dependent idx="2">is</dependent> 
     </dep> 
     <dep type="det"> 
     <governor idx="5">university</governor> 
     <dependent idx="3">a</dependent> 
     </dep> 
     <dep type="amod"> 
     <governor idx="5">university</governor> 
     <dependent idx="4">great</dependent> 
     </dep> 
     <dep type="vmod"> 
     <governor idx="5">university</governor> 
     <dependent idx="7">founded</dependent> 
     </dep> 
     <dep type="prep"> 
     <governor idx="7">founded</governor> 
     <dependent idx="8">in</dependent> 
     </dep> 
     <dep type="pobj"> 
     <governor idx="8">in</governor> 
     <dependent idx="9">1891</dependent> 
     </dep> 
    </dependencies> 
    <dependencies type="collapsed-dependencies"> 
     <dep type="root"> 
     <governor idx="0">ROOT</governo''' 


p1=re.compile("<parse>(.*)</parse>",re.DOTALL) 
parse=p1.findall(text) 
print parse 

輸出上面的代碼是:

['(ROOT\n   (S\n   (NP (NNP Stanford) (NNP University))\n   (VP (VBZ is)\n   (ADJP (JJ located)\n   (PP (IN in)\n   (NP (NNP California)))))\n   (. .)))\n\n   </parse>\n  <dependencies type="basic-dependencies">\n   <dep type="root">\n   <governor idx="0">ROOT</governor>\n   <dependent idx="4">located</dependent>\n   </dep>\n   <dep type="nn">\n   <governor idx="2">University</governor>\n   <dependent idx="1">Stanford</dependent>\n   </dep>\n   <dep type="nsubj">\n   <governor idx="4">located</governor>\n   <dependent idx="2">University</dependent>\n   </dep>\n   <dep type="cop">\n   <governor idx="4">located</governor>\n   <dependent idx="3">is</dependent>\n   </dep>\n   <dep type="prep">\n   <governor idx="4">located</governor>\n   <dependent idx="5">in</dependent>\n   </dep>\n   <dep type="pobj">\n   <governor idx="5">in</governor>\n   <dependent idx="6">California</dependent>\n   </dep>\n  </dependencies>\n  <dependencies type="collapsed-dependencies">\n   <dep type="root">\n   <governor idx="0">ROOT</governor>\n   <dependent idx="4">located</dependent>\n   </dep>\n   <dep type="nn">\n   <governor idx="2">University</governor>\n   <dependent idx="1">Stanford</dependent>\n   </dep>\n   <dep type="nsubj">\n   <governor idx="4">located</governor>\n   <dependent idx="2">University</dependent>\n   </dep>\n   <dep type="cop">\n   <governor idx="4">located</governor>\n   <dependent idx="3">is</dependent>\n   </dep>\n   <dep type="prep_in">\n   <governor idx="4">located</governor>\n   <dependent idx="6">California</dependent>\n   </dep>\n  </dependencies>\n  <dependencies type="collapsed-ccprocessed-dependencies">\n   <dep type="root">\n   <governor idx="0">ROOT</governor>\n   <dependent idx="4">located</dependent>\n   </dep>\n   <dep type="nn">\n   <governor idx="2">University</governor>\n   <dependent idx="1">Stanford</dependent>\n   </dep>\n   <dep type="nsubj">\n   <governor idx="4">located</governor>\n   <dependent idx="2">University</dependent>\n   </dep>\n   <dep type="cop">\n   <governor idx="4">located</governor>\n   <dependent idx="3">is</dependent>\n   </dep>\n   <dep type="prep_in">\n   <governor idx="4">located</governor>\n   <dependent idx="6">California</dependent>\n   </dep>\n  </dependencies>\n  </sentence>\n  <sentence id="2">\n  <tokens>\n   <token id="1">\n   <word>It</word>\n   <lemma>it</lemma>\n   <CharacterOffsetBegin>46</CharacterOffsetBegin>\n   <CharacterOffsetEnd>48</CharacterOffsetEnd>\n   <POS>PRP</POS>\n   <NER>O</NER>\n   </token>\n   <token id="2">\n   <word>is</word>\n   <lemma>be</lemma>\n   <CharacterOffsetBegin>49</CharacterOffsetBegin>\n   <CharacterOffsetEnd>51</CharacterOffsetEnd>\n   <POS>VBZ</POS>\n   <NER>O</NER>\n   </token>\n   <token id="3">\n   <word>a</word>\n   <lemma>a</lemma>\n   <CharacterOffsetBegin>52</CharacterOffsetBegin>\n   <CharacterOffsetEnd>53</CharacterOffsetEnd>\n   <POS>DT</POS>\n   <NER>O</NER>\n   </token>\n   <token id="4">\n   <word>great</word>\n   <lemma>great</lemma>\n   <CharacterOffsetBegin>54</CharacterOffsetBegin>\n   <CharacterOffsetEnd>59</CharacterOffsetEnd>\n   <POS>JJ</POS>\n   <NER>O</NER>\n   </token>\n   <token id="5">\n   <word>university</word>\n   <lemma>university</lemma>\n   <CharacterOffsetBegin>60</CharacterOffsetBegin>\n   <CharacterOffsetEnd>70</CharacterOffsetEnd>\n   <POS>NN</POS>\n   <NER>O</NER>\n   </token>\n   <token id="6">\n   <word>,</word>\n   <lemma>,</lemma>\n   <CharacterOffsetBegin>70</CharacterOffsetBegin>\n   <CharacterOffsetEnd>71</CharacterOffsetEnd>\n   <POS>,</POS>\n   <NER>O</NER>\n   </token>\n   <token id="7">\n   <word>founded</word>\n   <lemma>found</lemma>\n   <CharacterOffsetBegin>72</CharacterOffsetBegin>\n   <CharacterOffsetEnd>79</CharacterOffsetEnd>\n   <POS>VBN</POS>\n   <NER>O</NER>\n   </token>\n   <token id="8">\n   <word>in</word>\n   <lemma>in</lemma>\n   <CharacterOffsetBegin>80</CharacterOffsetBegin>\n   <CharacterOffsetEnd>82</CharacterOffsetEnd>\n   <POS>IN</POS>\n   <NER>O</NER>\n   </token>\n   <token id="9">\n   <word>1891</word>\n   <lemma>1891</lemma>\n   <CharacterOffsetBegin>83</CharacterOffsetBegin>\n   <CharacterOffsetEnd>87</CharacterOffsetEnd>\n   <POS>CD</POS>\n   <NER>DATE</NER>\n   <NormalizedNER>1891</NormalizedNER>\n   <Timex tid="t1" type="DATE">1891</Timex>\n   </token>\n   <token id="10">\n   <word>.</word>\n   <lemma>.</lemma>\n   <CharacterOffsetBegin>87</CharacterOffsetBegin>\n   <CharacterOffsetEnd>88</CharacterOffsetEnd>\n   <POS>.</POS>\n   <NER>O</NER>\n   </token>\n  </tokens>\n  <parse>(ROOT\n   (S\n   (NP (PRP It))\n   (VP (VBZ is)\n   (NP\n   (NP (DT a) (JJ great) (NN university))\n   (, ,)\n   (VP (VBN founded)\n   (PP (IN in)\n   (NP (CD 1891))))))\n   (. .)))\n\n   '] 

但我只需要解析標籤中的人物,沒有別的。請解決這個問題。以下應該是輸出:

'(ROOT\n   (S\n   (NP (NNP Stanford) (NNP University))\n   (VP (VBZ is)\n   (ADJP (JJ located)\n   (PP (IN in)\n   (NP (NNP California)))))\n   (. .)))\n\n   
(ROOT\n   (S\n   (NP (PRP It))\n   (VP (VBZ is)\n   (NP\n   (NP (DT a) (JJ great) (NN university))\n   (, ,)\n   (VP (VBN founded)\n   (PP (IN in)\n   (NP (CD 1891))))))\n   (. .)))\n\n   
+3

使用XML解析器。 – 2015-04-17 09:28:36

+0

請閱讀http://stackoverflow.com/help/mcve – jonrsharpe

+0

'p1 = re.compile(「(。*?)」,re.DOTALL)' –

回答

1

如果你需要使用正則表達式,用這一個:

(?s)<parse>(.*?)</parse> 

demo

import re 
p = re.compile(ur'(?s)<parse>(.*?)</parse>') 
parse = re.findall(p, text) 
print parse 
+0

感謝您節省了一天的時間。 – UserP

+0

您隨時歡迎您! +1一個很好的問題。 –

0

使用BeautifulSoup解析器。

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(text) 
>>> for i in soup.select('parse'): 
     print(i.text) 


(ROOT 
     (S 
     (NP (NNP Stanford) (NNP University)) 
     (VP (VBZ is) 
     (ADJP (JJ located) 
     (PP (IN in) 
     (NP (NNP California))))) 
     (. .))) 


(ROOT 
     (S 
     (NP (PRP It)) 
     (VP (VBZ is) 
     (NP 
     (NP (DT a) (JJ great) (NN university)) 
     (, ,) 
     (VP (VBN founded) 
     (PP (IN in) 
     (NP (CD 1891)))))) 
     (. .))) 
+0

** + 1 **這絕對是正確的方法去。否則解析器仍然會讀取<! - 我們不再需要這個:(DO(BAD THINGS)) - >' – Eric

+0

你真的想'BeautifulSoup(text,「xml」)'雖然 – Eric

+0

@Eric我試過,但它不會識別第二個解析標籤。 –

相關問題