2017-07-03 77 views
-1
使用正則表達式提取與開始和結束匹配字符串文本的所有相關部分

我已經發布了關於similar question Python中字符提取使用正則表達式,但我有一個非貪婪量詞另一個問題,所以我用一個不同的例子問一個問題。問題是我需要使用Python中的正則表達式提取字符串文本的所有相關部分,並使用兩個特定的匹配項。具體而言,這裏是一個例子文本:通過在Python

example = """ 
    The Bank does offer a hybrid loan. Hybrid loans are loans that start as a 
    fixed rate mortgage but after a set number of years automatically adjust 
    to an adjustable rate mortgage. The Bank offers a three year fixed rate mortgage 
    after which the interest rate will adjust annually. Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 Item 2. Properties 15-16 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
    """ 

,並和我想提取「之間」從開始起匹配「項目1.」的文本的部分和結束匹配「項目2.」,所以最後的結果應該是這樣的:

final_result_1 = """ 
    ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897. 
    """ 

final_result_2 = """ 
    Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 
    """ 

最終結果的順序應該是在最終結果的文本的長度方面,所以「final_result_1」是兩個中最長的文本部分,'final_result_2'是最短的一個。你可以參考上一個問題here的答案。先謝謝你!

+0

我很想幫忙,但這個問題是非常令人迷惑。你能否創建一些簡短的示例文本並解釋一下你想要輸出的內容? –

+0

@krcoder,你需要從文本中排除「ITEM 2」,對不對? –

+0

@code_byter,這是真的,以及'final_result_2'被排除的'Item 2'。 – krcoder

回答

1

我相信你需要使用

import re; 
example = """ 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
""" 
matches = re.findall('(ITEM\ 1[\s\S]*)ITEM\ 2', example,re.IGNORECASE); 
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list. 
matches.sort(key = len, reverse = True) 
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller) 

編輯:(OP欲養而不能什麼)

import re; 
example = """ 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
""" 
pat = re.compile('(ITEM\ 1[\s\S]*)ITEM\ 2',re.IGNORECASE); 
matches = pat.findall(example) 
print(matches) 
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list. 
matches.sort(key = len, reverse = True) 
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller) 
print(matches) 

代碼測試

最後編輯:

import re; 
example = """ 
    The Bank does offer a hybrid loan. Hybrid loans are loans that start as a 
    fixed rate mortgage but after a set number of years automatically adjust 
    to an adjustable rate mortgage. The Bank offers a three year fixed rate mortgage 
    after which the interest rate will adjust annually. Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 Item 2. Properties 15-16 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
""" 
pat = re.compile('(ITEM\ 1[\s\S]*?)ITEM\ 2',re.IGNORECASE); 
matches = pat.findall(example) 
print(matches) 
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list. 
matches.sort(key = len, reverse = True) 
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller) 

#To check if it works: 
for match in matches: 
    print(match) 
    print('\n') 

何不你現在試試嗎? :)

+0

謝謝你的答案,但這不是我所期望的。正如我上面提到的,關於示例文本的一個具體問題是應該實現非貪婪匹配,因爲在整個文本中有多個開始('item 1')和end('item 2')匹配。 – krcoder

+0

我明白了。讓我看看我能做些什麼。 –

+0

具體地,在上面的例子中的文字,在第四行,有開始與所述第一啓動匹配「項目1.商業3-13 ...」,和在第五行中,在第一端的起始匹配'項目2屬性15-16 ...'。 – krcoder