正則表達式/ Python的：N - 比賽的出現先於其他比賽

我有一個XML文件的結構是這樣的：正則表達式/ Python的：N - 比賽的出現先於其他比賽

<word id="15" pos="SS"> 
      <token>infarto</token> 
      <lemmas>infarto</lemmas> 
     </word> 
     <word id="16" pos="AS"> 
      <token>miocardico</token> 
      <lemmas>miocardico</lemmas> 
     </word> 
     <word id="17" pos="AS" annotated="head"> 
      <token>acuto</token> 
      <lemmas>acuto</lemmas> 
     </word> 
     <word id="18" pos="E"> 
      <token>in</token> 
      <lemmas>in</lemmas> 
     </word> 
     <word id="19" pos="SS"> 
      <token>corso</token> 
      <lemmas>corso</lemmas> 
     </word>

我試圖做的，越來越爲「POS」和「令牌值「這個詞圍繞着一個單詞id 17（註釋=」頭「一個）。

這是所有沒有問題的匹配單詞後未來17

(pos=")(.+)(")(\s\S+?)("head")([\s\S]+?)(>)(\w+?)(<+)([\S\s]+?)(pos=")(.+)(")([\s\S]+?) (token>)(.+)(<)([\s\S]+?)

這讓我所有我想要的信息，如果我想擴大我可以再補充

(pos=")(.+)(")([\s\S]+?)(token>)(.+)(<)([\s\S]+?)

到結束。它不漂亮，但它的工作原理。

現在，當我去想去的地方到另一個方向，我絕對難倒

(pos=")(.+)(")([\s\S]+?)(token>)(.+)(<)([\s\S]+?)(pos=")(.+)(")(\s\S+?)("head")

相反匹配字16（第一個在「註釋頭」前）的唯一的信息，它匹配之前的所有信息（字15，字14，字13等）。

我錯過了什麼？

P.S. 使用XML解析器可悲的是不是一種選擇。

來源

2012-08-07 lhausmann

您應該使用XML庫來處理這種類型的任務，而不是正則表達式。 – armandino 2012-08-07 09:21:05

你不應該使用html或xml的正則表達式。 http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – 2012-08-07 09:22:17

http://stackoverflow.com/questions/6751105/why-its-not-possible- to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – 2012-08-07 09:24:02

我認爲應該是類似的東西：

(?s)(<word(?:(?!<word).)*)<word[^>]*?annotated="head".*?(<word[^>](?:(?<!</word>).)*)

其結果是，組＃1的意願包含節點「字」與ID = 16和組＃2將包含節點「字」使用id = 18

然後你就可以分析每個節點分別使用正則表達式，如下列：

(?s)<word[^>]*?pos="(?P<pos>[^"]+).*?<token>(?P<token>[^<]+)

，你會得到兩個組的POS'和「令牌」。

當然可以使用一個正則表達式，但它會非常難看。

來源

2012-08-07 18:04:11

謝謝！那個工作就像一個咒語！ – lhausmann 2012-08-09 12:23:37

如果您確定您的數據是格式良好的XML。我認爲這是可能的，嘗試用下列步驟操作：

第一步：<word[^>]*>([^<]*(?:(?!<\/?word)<[^<]*)*)<\/word>（REF http://regexr.com?31org）
第二步：得到從步驟1（第1組），並匹配字符串<token[^>]*>([^<]*(?:(?!<\/?token)<[^<]*)*)<\/token>（REF http://regexr.com?31ora）或<lemmas[^>]*>([^<]*(?:(?!<\/?lemmas)<[^<]*)*)<\/lemmas>（REF http://regexr.com?31ord ）

你可以嘗試修改這些模式對你的要求:)

參考：掌握第三正則表達式

來源

2012-08-07 12:25:02 godspeedlee

正則表達式/ Python的：N - 比賽的出現先於其他比賽

回答

相關問題