2014-09-23 131 views
0
re_newspeaker =   r'^(<bullet> | )(?P<name>(%s|(((Mr)|(Ms)|(Mrs))\. [-A-Za-z \']+(of [A-Z][a-z]+)?))|((The ((VICE|ACTING|Acting))?(PRESIDENT|SPEAKER|CHAIR(MAN)?)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK)|(The CHIEF JUSTICE)|(The VICE PRESIDENT)|(Mr\. Counsel [A-Z]+))(\([A-Za-z.\'\- ]+\))?)\.' 


re_speaking =   r'^(<bullet> | )((((((Mr)|(Ms)|(Mrs))\. [A-Za-z \'\-]+(of [A-Z][a-z]+)?)|((The (VICE |Acting |ACTING)?(PRESIDENT|SPEAKER)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK))(\([A-Za-z.\'\- ]+\))?))\.)?(?P<start>.)' 

由於某種原因,上述正則表達式沒有捕獲帶撇號的名稱。Python正則表達式匹配撇號

例如:D'STALL先生 未匹配。任何與正則表達式模式的幫助將是最受讚賞。

代碼的作用是獲取輸入並將其標記爲XML。如下所示:

<speaker=Mr. D'STALL</speaker><speaking>Mr. President, I have been seeking to obtain a report on 
this bill. I am not on the Budget Committee, and I am not on the 
Government Relations Committee. But from what I understand, this is a 
very important bill, a big bill, a complex bill, far reaching in its 
contents. I have been queried, along with all other Senators, I 
suppose, as to whether or not they would have any objection to the 
adoption of the committee amendments, en bloc. I am going to object to 
the adoption of the committee amendments, en bloc, until I see the 
committee report.</speaking> 

    Mr. D'STALL. Mr. President, I have been seeking to obtain a report on 
this bill. I am not on the Budget Committee, and I am not on the 
Government Relations Committee. But from what I understand, this is a 
very important bill, a big bill, a complex bill, far reaching in its 
contents. I have been queried, along with all other Senators, I 
suppose, as to whether or not they would have any objection to the 
adoption of the committee amendments, en bloc. I am going to object to 
the adoption of the committee amendments, en bloc, until I see the 
committee report. 

該正則表達式不符合上述段落。

+5

這是多麼可怕的不可維護的模式,你去那裏。我認爲這個問題會影響兩種模式? – 2014-09-23 08:47:42

+1

http://regex101.com/r/dT6dN8/1 – 2014-09-23 08:47:58

+0

你的正則表達式需要在開始時有一個'space'或'bullet',它是否在你的輸入中? – vks 2014-09-23 08:49:51

回答

0
re_newspeaker =   r'^(<bullet> | )(?P<name>(%s|(((Mr)|(Ms)|(Mrs))\. [-A-Z\']+|((Miss) [-A-Z\']+)(of [A-Z][a-z]+)?))|((The ((VICE|ACTING|Acting))?(PRESIDENT|SPEAKER|CHAIR(MAN)?)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK)|(The CHIEF JUSTICE)|(The VICE PRESIDENT)|(Mr\. Counsel [A-Z]+))(\([A-Za-z.\- ]+\))?)\.' 

re_speaking =   r'^(<bullet> | )((((((Mr)|(Ms)|(Mrs))\. [A-Z\']+|((Miss) [-A-Z\']+)(of [A-Z][a-z]+)?)|((The (VICE |Acting |ACTING)?(PRESIDENT|SPEAKER)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK))(\([A-Za-z.\- ]+\))?))\.)?(?P<start>.)' 

上述RegEx解決了我的問題。我以爲如果其他人有這個問題,我會發布它!