在Perl中駝峯（WikiWord）Utf8正確的正則表達式

這裏有一個關於CamelCase regex的問題。結合tchrist post我想知道什麼是正確UTF-8駝峯。在Perl中駝峯（WikiWord）Utf8正確的正則表達式

與（布萊恩·d FOY的）正則表達式開始：

/ 
    \b   # start at word boundary 
    [A-Z]  # start with upper 
    [a-zA-Z]* # followed by any alpha 

    (?: # non-capturing grouping for alternation precedence 
     [a-z][a-zA-Z]*[A-Z] # next bit is lower, any zero or more, ending with upper 
      |      # or 
     [A-Z][a-zA-Z]*[a-z] # next bit is upper, any zero or more, ending with lower 
    ) 

    [a-zA-Z]* # anything that's left 
    \b   # end at word 
/x

和修改到：

/ 
    \b   # start at word boundary 
    \p{Uppercase_Letter}  # start with upper 
    \p{Alphabetic}*   # followed by any alpha 

    (?: # non-capturing grouping for alternation precedence 
     \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter} ### next bit is lower, any zero or more, ending with upper 
      |     # or 
     \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter} ### next bit is upper, any zero or more, ending with lower 
    ) 

    \p{Alphabetic}*   # anything that's left 
    \b   # end at word 
/x

有線路問題標記爲 '###'。

此外，如何修改正則表達式時，假定比數字和下劃線等價於小寫字母，所以W2X3是一個有效的駝峯字。

更新時間：（YSTH評論）

下一個，

any：意思是「大寫或小寫字母或數字或下劃線」

正則表達式應該匹配CamelWord， CaW

開始用大寫字母
可選任何
小寫字母或數字或下劃線
可選任何
大寫字母
可選任何

請，不標記爲重複，因爲它不是。 original question（和答案）只認爲ascii。

來源

2011-06-12 jm666

別名也就是說，你已經開始與一個真正奇怪的正則表達式;我認爲它與簡單的'/ \ b [AZ] + [az] [A-Za-z] * \ b /'不同，它與任何不同的東西都不相同（一個「單詞」僅由字母組成，以大寫字母幷包括至少一個小寫字母）（更新：我錯了，原始正則表達式至少需要三個字母。） – ysth 2011-06-12 16:25:14

無論如何，請不要以ASCII正則表達式開頭;開始儘可能準確定義你想要匹配什麼 – ysth 2011-06-12 16:29:01

更新了問題 - （我希望是足夠的）精確定義 – jm666 2011-06-12 17:02:57

我真的不知道你想要做什麼，但這應該更接近你原來的意圖。不過，我仍然無法分辨你的意思。

m{ 
    \b 
    \p{Upper}  # start with uppercase code point (NOT LETTER) 

    \w*   # optional ident chars 

    # note that upper and lower are not related to letters 
    (?: \p{Lower} \w* \p{Upper} 
     | \p{Upper} \w* \p{Lower} 
    ) 

    \w* 

    \b 
}x

千萬不要使用[a-z]。而實際上，不要使用\p{Lowercase_Letter}或\p{Ll}，因爲那些不是更理想和更正確的\p{Lowercase}和\p{Lower}。

請記住，\w實際上只是

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

來源

2011-06-12 18:19:13 tchrist

爲什麼'小寫字母'和'下部'更可取？（即它們包括「Ll」不包含的內容）「小寫」和「下」（如果有）之間的區別是什麼？ – ikegami 2011-06-12 21:32:11

@ikegami：'Lowercase'和'Lower'是相同的，是'GC = Lowercase_Letter'和'Other_Lowercase = True'的聯合。有201個代碼點或者是'Lower'*，但不是*'GC = Ll'，否則是'Upper' *，而不是''GC = Lu'。這些包括'GC = Mn'，'GC = Lm'，'GC = N1'和'GC = So'碼點。 ***對不起，我真的以爲這是現在所有的常識！***運行'unichars -gs'/（？= \ P {Ll}）\ p {下}/x || /（？= \ P {Lu}）\ p {Upper}/x'| ucsort --upper-before-lower | cat -n |少看我的意思。這些程序在我的[unicode toolchest]（http://training.perl.com/scripts/）中。 – tchrist 2011-06-12 23:36:07

@tchrist - 到unicode工具集的鏈接已經失效（至少現在）。任何替代品？ – jm666 2014-05-15 15:09:36

在Perl中駝峯（WikiWord）Utf8正確的正則表達式

回答

相關問題