2010-12-03 55 views
1

我實質上是試圖替換大文本中的所有腳註。在Objective-C中我有很多種原因,所以請假設這個約束。RegexKitLite:匹配表達式 - >匹配除了] - > Match]

每個腳註衆生本:[腳註

每個腳註只能到此爲止:]

可以有這兩種標記物,包括換行符之間不惜一切代價。但是,他們之間永遠不會有]。

所以,基本上我想匹配[腳註,然後匹配任何東西除外],直到]匹配。

這是最接近我已經能夠去確定所有腳註:使用正則表達式設法找出八百八十九分之七百八十零腳註

NSString *regexString = @"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]"; 

。它似乎也沒有一個是虛假警報。似乎錯過的只有那些有斷行符的腳註。

我在www.regular-expressions.info上花了很長時間,特別是在關於點的頁面上(http://www.regular-expressions.info/dot.html)。這有助於我創建上面的正則表達式,但我還沒有真正想出如何包含任何字符或換行符,除了右括號。

使用下面的正則表達式,而不是設法捕捉所有腳註的,但它抓住了太多的文字,因爲*是貪婪:(?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]

下面是一些示例文本的正則表達式上運行:

<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p> 

    <p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p> 

    <p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p> 

    <p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p> 

When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.-- 

[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".] 

在這個例子中有兩個腳註和一些非腳註文本。正如你所看到的,第一個腳註包含兩個換行符。第二個不包含換行符。

上面提到的第一個正則表達式將在本示例文本中捕獲腳註2,但它不會捕獲腳註1,因爲它包含換行符。

對我的正則表達式的任何改進都將非常感謝。

回答

3

嘗試

@"\\[Footnote[^\\]]*\\]"; 

這應該跨越換行符匹配。無需將單個字符放入字符類中。

作爲評論的,正則表達式多(沒有字符串轉義):

\[  # match a literal [ 
Footnote # match literal "Footnote" 
[^\]]* # match zero or more characters except ] 
\]  # match ] 

在字符類([...]),光標^呈現出不同的含義;它否定了課堂的內容。所以[ab]匹配ab,而[^ab]匹配除ab以外的任何字符。

當然,如果你有嵌套的腳註,這將會失效。像[Footnote foo [footnote bar] foo]這樣的文本將從開始到匹配bar]。爲避免這種情況,將正則表達式更改爲

@"\\[Footnote[^\\]\\[]*\\]"; 

因此,不允許打開或關閉括號。那麼當然,你只匹配最裏面的腳註,並且必須對整個文本應用相同的正則表達式兩次(或更多,取決於最大嵌套水平),逐層「剝離」。

+0

這似乎工作。它匹配883次,但它取代了所有的腳註(889),所以顯然有6次它吞沒了兩個腳註而不是一個腳註。也許有三個嵌套的腳註?我需要一段時間才能找到它們。 這爲什麼有效?我不明白[^ \\]] *是如何工作的。不應該只是尋找以右括號開頭的行嗎?我認爲^角色應該「在一行的開頭匹配」。 – 2010-12-03 21:14:19