2009-09-01 70 views
0

匹配多個模式我有一些數據,看起來像:多行字符串

PMID- 19587274 
OWN - NLM 
DP - 2009 Jul 8 
TI - Domain general mechanisms of perceptual decision making in human cortex. 
PG - 8675-87 
AB - To successfully interact with objects in the environment, sensory evidence must 
     be continuously acquired, interpreted, and used to guide appropriate motor 
     responses. For example, when driving, a red 
AD - Perception and Cognition Laboratory, Department of Psychology, University of 
     California, San Diego, La Jolla, California 92093, USA. 

PMID- 19583148 
OWN - NLM 
DP - 2009 Jun 
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic 
     amyloidosis. 
PG - 482-6 
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by 
     extracellular accumulation of pathologic fibrillar proteins in various tissues 
AD - Asklepios Hospital, Department of Medicine, Langen, Germany. 
     [email protected] 

我想寫一個正則表達式可以匹配隨後PMID,TI和AB的句子。

是否有可能得到這些在一個鏡頭正則表達式?

我花了幾乎整整一天,試圖找出一個正則表達式,我能得到的最接近的是:

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD' 
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict() 

將返回我的比賽只是在數據的第二個「設置」,而不是全部。

有什麼想法?謝謝!

回答

2

如何:

import re 
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL) 
for i in reg4.finditer(data): 
    print i.groupdict() 

輸出:

{'pmid': '19587274', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': None} 
{'pmid': '19583148', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'} 
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None} 

編輯

作爲一個詳細的RE,以使其更容易理解(我想詳細的RE應該用於任何東西,但最簡單的表達方式,但這只是我的看法!):

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^     # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     (?:     # Non capturing group with multiple options, first option: 
      PMID-\s   # Literal "PMID-" followed by a space 
      (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     |      # Next option: 
      TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
      (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
      ^PG    # The characters PG at the start of a line 
     |      # Next option 
      AB\s{2}-\s  # "AB - " 
      (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
      ^AD    # "AD" at the start of a line 
     ) 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict() 

請注意,您可以將^PG^AD替換爲^\S以使其更通用(您希望匹配所有內容,直到行的第一個非空格爲止)。

編輯2

如果你想趕上整個事情在一個正則表達式,擺脫了開始(?:,結束)|字符更改爲.*?的:

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^    # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     PMID-\s   # Literal "PMID-" followed by a space 
     (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     .*?    # Next part: 
     TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
     (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
     ^PG    # The characters PG at the start of a line 
     .*?    # Next option 
     AB\s{2}-\s  # "AB - " 
     (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
     ^AD    # "AD" at the start of a line 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict() 

這給出:

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'} 
+0

只是要補充一點,你原來的正則表達式中的一個問題可能是貪婪的'。*'模式太多了,e-Jah - 它太匹配了,因此「貪婪地吃掉了」所有的最後的記錄作爲貪婪匹配的一部分,所以你實際上得到了與最後一個條目的抽象/標題匹配的第一個條目的PMID(並且所有其他條目將在第一個匹配的第一個條目中被吃掉'。*'模式)。 – Amber 2009-09-01 09:18:13

0

該問題米是貪婪的預選賽。這裏有一個正則表達式是比較具體,非貪婪:

#!/usr/bin/python 
import re 
from pprint import pprint 
data = open("testdata.txt").read() 

reg4 = r''' 
    ^PMID    # Start matching at the string PMID 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters 
    .*?TI    # next, match any character up to the first occurrence of 'TI' 
    \s*?-    # as little whitespace as possible up to the next '-' 
    \s*?    # as little whitespace as possible 
    (?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG' 
    .*?AB    # match any character up to the following occurrence of 'AB' 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD' 
''' 
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE): 
    print 78*"-" 
    pprint(i.groupdict()) 

輸出:

------------------------------------------------------------------------------ 
{'abstract': ' To successfully interact with objects in the environment, 
    sensory evidence must\n  be continuously acquired, interpreted, and 
    used to guide appropriate motor\n  responses. For example, when 
    driving, a red \n', 
'pmid': '19587274', 
'title': ' Domain general mechanisms of perceptual decision making in 
    human cortex.\n'} 
------------------------------------------------------------------------------ 
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different 
    diseases characterized by\n  extracellular accumulation of pathologic 
    fibrillar proteins in various tissues\n', 
'pmid': '19583148', 
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients 
    with hepatic\n  amyloidosis.\n'} 

您可能要strip每個字段的掃描後的空白。

+0

只有一點:如果在摘要中的標題或AD中有文本「PG」,這個正則表達式就會出現問題。添加'^'行首限定符將解決此問題。 – DrAl 2009-09-01 09:27:42

+0

謝謝@Al。修復。 – exhuma 2009-09-01 10:14:16

0

另一個正則表達式:

reg4 = r'(?<=PMID-)(?P<pmid>.*?)(?=OWN -).*?(?<=TI -)(?P<title>.*?)(?=PG -).*?(?<=AB -)(?P<abstract>.*?)(?=AD -)' 
2

如何不使用正則表達式完成這個任務,而是使用由新行分割,使用.startswith()等着眼於前綴碼的程序代碼? 代碼會更長,但每個人都可以理解它,而無需進入幫助。

+0

已經用很長的正則表達式回答了這個問題,我必須同意PēterisCaune的觀點:'.startswith()'代碼風格最終可能會有點混亂,但與正則表達式所需的複雜性相比,它會更好。這也很容易理解。你也可以在網上找到一些現成的解析器來爲你做這項工作...... – DrAl 2009-09-01 10:19:18

0

如果行的順序可以改變,你可以使用這個模式:

reg4 = re.compile(r""" 
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+) \n 
    | TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)*) \n 
    | AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)*) \n 
    | .+\n 
    )+ 
""", re.MULTILINE | re.VERBOSE) 

它將匹配連續的非空行,並捕獲PMIDTIAB項目。

項目值可以跨越多行,只要第一行後面的行以空格字符開始。

  • [^\S\n]」 匹配任何空白字符(\s),除了換行(\n)。
  • .* (?:\n[^\S\n].*)*」匹配以空白字符開頭的連續行。
  • .+\n」與任何其他非空行匹配。