2015-12-02 60 views
1

這是SO上的PHP sentences boundaries question的擴展。PHP句子邊界包括空行嗎?

我想知道如何改變正則表達式,以保持換行也是如此。

示例代碼逐句分割一些文本,刪除一個句子,然後一起放回:

<?php 
$re = '/# Split sentences on whitespace between them. 
    (?<=    # Begin positive lookbehind. 
     [.!?]    # Either an end of sentence punct, 
    | [.!?][\'"]  # or end of sentence punct and quote. 
    )     # End positive lookbehind. 
    (?<!    # Begin negative lookbehind. 
     Mr\.    # Skip either "Mr." 
    | Mrs\.    # or "Mrs.", 
    | Ms\.    # or "Ms.", 
    | Jr\.    # or "Jr.", 
    | Dr\.    # or "Dr.", 
    | Prof\.   # or "Prof.", 
    | Sr\.    # or "Sr.", 
    | T\.V\.A\.   # or "T.V.A.", 
         # or... (you get the idea). 
    )     # End negative lookbehind. 
    [\s+|^$]   # Split on whitespace between sentences/empty lines. 
    /ix'; 

$text = <<<EOL 
This is paragraph one. This is sentence one. Sentence two! 

This is paragraph two. This is sentence three. Sentence four! 
EOL; 

echo "\nBefore: \n" . $text . "\n"; 

$sentences = preg_split($re, $text, -1); 

$sentences[1] = " "; // remove 'sentence one' 

// put text back together 
$text = implode($sentences); 

echo "\nAfter: \n" . $text . "\n"; 
?> 

運行此,輸出是

Before: 
This is paragraph one. This is sentence one. Sentence two! 

This is paragraph two. This is sentence three. Sentence four! 

After: 
This is paragraph one. Sentence two! 
This is paragraph two. This is sentence three. Sentence four! 

我試圖讓「之後'文本與'之前'文本相同,只是刪除了一個句子。

After: 
This is paragraph one. Sentence two! 

This is paragraph two. This is sentence three. Sentence four! 

我希望這可以做一個正則表達式的調整,但我錯過了什麼?

+1

貌似有這正則表達式的問題:'[\ S + |^$]'真的匹配的空白,'+','|','^'和'$'符號。用'(?:\ h + |^$)'代替,我想就是這樣。 –

+0

我想你可以在'+'了'\ s'後只是刪除或'\ S {1}'如果你真的需要它來匹配一個,因爲'\ S +'在消費其他的空格。本質上你需要'array(「stuf」,「\ n」,「stuff」);'但是不確定沒有測試它,而且這太複雜了,只能在我的腦海中運行。 – ArtisticPhoenix

回答

1

圖案的端部應替換:

(?:\h+|^$)   # Split on whitespace between sentences\/empty lines. 
/mix'; 

參見IDEONE demo

注意[\s+|^$]確實匹配空白(水平和垂直,像新行),+|^$符號,因爲它是一個字符類別

而不是一個字符類,一組(更好,這裏非捕獲)是必要的。在組內(標記爲(...)),|可用作替代運算符。

而不是\s,我建議使用\h匹配水平空白(沒有linebreaks)只。

如果沒有使用/m多行修飾符,^$將只匹配空字符串。所以,我已將/m修飾符添加到選項中。

而且注意,我不得不逃離最後一個註釋裏面的/,否則有一個警告,正則表達式是不正確。或者,使用不同的正則表達式分隔符。

+0

謝謝。這幾乎適用於一個怪癖:preg_split正則表達式將兩個句子結合在一起。見http://ideone.com/AUImET有什麼想法?也感謝\ h的解釋我不熟悉它。 – johnh10

+0

如果你添加一個'PREG_SPLIT_DELIM_CAPTURE',使用與捕獲組'(\ H + |^$)'和指數2零出元素?請參閱[本演示](http://ideone.com/ddq1hV)。 –