多行字符串

匹配多個模式我有一些數據，看起來像：多行字符串

PMID- 19587274 
OWN - NLM 
DP - 2009 Jul 8 
TI - Domain general mechanisms of perceptual decision making in human cortex. 
PG - 8675-87 
AB - To successfully interact with objects in the environment, sensory evidence must 
     be continuously acquired, interpreted, and used to guide appropriate motor 
     responses. For example, when driving, a red 
AD - Perception and Cognition Laboratory, Department of Psychology, University of 
     California, San Diego, La Jolla, California 92093, USA. 

PMID- 19583148 
OWN - NLM 
DP - 2009 Jun 
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic 
     amyloidosis. 
PG - 482-6 
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by 
     extracellular accumulation of pathologic fibrillar proteins in various tissues 
AD - Asklepios Hospital, Department of Medicine, Langen, Germany. 
     [email protected]

我想寫一個正則表達式可以匹配隨後PMID，TI和AB的句子。

是否有可能得到這些在一個鏡頭正則表達式？

我花了幾乎整整一天，試圖找出一個正則表達式，我能得到的最接近的是：

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD' 
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()

將返回我的比賽只是在數據的第二個「設置」，而不是全部。

有什麼想法？謝謝！

來源

2009-09-01 e-Jah

如何：

import re 
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL) 
for i in reg4.finditer(data): 
    print i.groupdict()

輸出：

{'pmid': '19587274', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': None} 
{'pmid': '19583148', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'} 
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}

編輯

作爲一個詳細的RE，以使其更容易理解（我想詳細的RE應該用於任何東西，但最簡單的表達方式，但這只是我的看法！）：

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^     # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     (?:     # Non capturing group with multiple options, first option: 
      PMID-\s   # Literal "PMID-" followed by a space 
      (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     |      # Next option: 
      TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
      (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
      ^PG    # The characters PG at the start of a line 
     |      # Next option 
      AB\s{2}-\s  # "AB - " 
      (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
      ^AD    # "AD" at the start of a line 
     ) 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict()

請注意，您可以將^PG和^AD替換爲^\S以使其更通用（您希望匹配所有內容，直到行的第一個非空格爲止）。

編輯2

如果你想趕上整個事情在一個正則表達式，擺脫了開始(?:，結束)和|字符更改爲.*?的：

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^    # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     PMID-\s   # Literal "PMID-" followed by a space 
     (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     .*?    # Next part: 
     TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
     (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
     ^PG    # The characters PG at the start of a line 
     .*?    # Next option 
     AB\s{2}-\s  # "AB - " 
     (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
     ^AD    # "AD" at the start of a line 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict()

這給出：

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'}

來源

2009-09-01 09:14:15 DrAl

只是要補充一點，你原來的正則表達式中的一個問題可能是貪婪的'。*'模式太多了，e-Jah - 它太匹配了，因此「貪婪地吃掉了」所有的最後的記錄作爲貪婪匹配的一部分，所以你實際上得到了與最後一個條目的抽象/標題匹配的第一個條目的PMID（並且所有其他條目將在第一個匹配的第一個條目中被吃掉'。*'模式）。 – Amber 2009-09-01 09:18:13

該問題米是貪婪的預選賽。這裏有一個正則表達式是比較具體，非貪婪：

#!/usr/bin/python 
import re 
from pprint import pprint 
data = open("testdata.txt").read() 

reg4 = r''' 
    ^PMID    # Start matching at the string PMID 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters 
    .*?TI    # next, match any character up to the first occurrence of 'TI' 
    \s*?-    # as little whitespace as possible up to the next '-' 
    \s*?    # as little whitespace as possible 
    (?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG' 
    .*?AB    # match any character up to the following occurrence of 'AB' 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD' 
''' 
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE): 
    print 78*"-" 
    pprint(i.groupdict())

輸出：

------------------------------------------------------------------------------ 
{'abstract': ' To successfully interact with objects in the environment, 
    sensory evidence must\n  be continuously acquired, interpreted, and 
    used to guide appropriate motor\n  responses. For example, when 
    driving, a red \n', 
'pmid': '19587274', 
'title': ' Domain general mechanisms of perceptual decision making in 
    human cortex.\n'} 
------------------------------------------------------------------------------ 
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different 
    diseases characterized by\n  extracellular accumulation of pathologic 
    fibrillar proteins in various tissues\n', 
'pmid': '19583148', 
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients 
    with hepatic\n  amyloidosis.\n'}

您可能要strip每個字段的掃描後的空白。

來源

2009-09-01 09:22:03 exhuma

只有一點：如果在摘要中的標題或AD中有文本「PG」，這個正則表達式就會出現問題。添加'^'行首限定符將解決此問題。 – DrAl 2009-09-01 09:27:42

謝謝@Al。修復。 – exhuma 2009-09-01 10:14:16

另一個正則表達式：

reg4 = r'(?<=PMID-)(?P<pmid>.*?)(?=OWN -).*?(?<=TI -)(?P<title>.*?)(?=PG -).*?(?<=AB -)(?P<abstract>.*?)(?=AD -)'

來源

2009-09-01 09:33:01

如何不使用正則表達式完成這個任務，而是使用由新行分割，使用.startswith（）等着眼於前綴碼的程序代碼？代碼會更長，但每個人都可以理解它，而無需進入幫助。

來源

2009-09-01 10:02:20

已經用很長的正則表達式回答了這個問題，我必須同意PēterisCaune的觀點：'.startswith（）'代碼風格最終可能會有點混亂，但與正則表達式所需的複雜性相比，它會更好。這也很容易理解。你也可以在網上找到一些現成的解析器來爲你做這項工作...... – DrAl 2009-09-01 10:19:18

如果行的順序可以改變，你可以使用這個模式：

reg4 = re.compile(r""" 
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+) \n 
    | TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)*) \n 
    | AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)*) \n 
    | .+\n 
    )+ 
""", re.MULTILINE | re.VERBOSE)

它將匹配連續的非空行，並捕獲PMID，TI和AB項目。

項目值可以跨越多行，只要第一行後面的行以空格字符開始。

「[^\S\n]」匹配任何空白字符（\s），除了換行（\n）。
「.* (?:\n[^\S\n].*)*」匹配以空白字符開頭的連續行。
「.+\n」與任何其他非空行匹配。

來源

2009-09-01 10:23:10

回答

相關問題