摺疊/標準化連字（例如Æ到ae）使用（Core）基礎

我正在編寫一個幫助程序，對輸入字符串執行大量轉換，以創建該字符串的搜索友好表示。摺疊/標準化連字（例如Æ到ae）使用（Core）基礎

認爲以下場景：在德語或法語文本

在數據存儲中的條目包含

Müller
Großmann
Çingletòn
Bjørk
- 全文搜索
- Æreogramme

的搜索應該是模糊的，在

ull，Üll等比賽Müller
Gros，groß等比賽Großmann
cin等比賽Çingletòn
bjö，bjo等比賽Bjørk
aereo等比賽Æreogramme

到目前爲止，我已經成功地案例（1），（3）和（4）。

我弄不明白，是如何處理（2）和（5）。

到目前爲止，我已經嘗試以下方法不得要領：

CFStringNormalize() // with all documented normalization forms 
CFStringTransform() // using the kCFStringTransformToLatin, kCFStringTransformStripCombiningMarks, kCFStringTransformStripDiacritics 
CFStringFold() // using kCFCompareNonliteral, kCFCompareWidthInsensitive, kCFCompareLocalized in a number of combinations -- aside: how on earth do I normalize simply _composing_ already decomposed strings??? as soon as I pack that in, my formerly passing tests fail, as well...

我掠過ICU User Guide for Transforms，但並沒有太投入巨資......什麼，我認爲是很明顯的原因。

我知道我可以通過轉換爲大寫，然後回到小寫來捕捉case（2），這將在這個特定應用程序的領域內工作。然而，我有興趣從更基礎的層面解決這個問題，希望能夠允許區分大小寫的應用。

任何提示將不勝感激！

來源

2012-02-21 danyowdee

恭喜，您已經發現了文本處理中比較痛苦的一點！

首先，NamesList.txt和CaseFolding.txt是這樣的事情不可或缺的資源，如果你還沒有看到它們。

部分問題是您正在嘗試做些什麼幾乎正確適用於您關心的所有語言/語言環境，而Unicode更關心在單一語言中顯示字符串時做正確的事情 - 語言環境。

對於（2），ß已經正式case-folded到ss，因爲我可以找到最早的CaseFolding.txt（3.0-Update1/CaseFolding-2.txt）。 CFStringFold()和-[NSString stringByFoldingWithOptions:]應該做正確的事情，但如果不是，「區域獨立」s.upper().lower()似乎給所有投入（並且處理臭名昭着的「土耳其我」）一個明智的答案。對於（5），你有點不幸運：Unicode 6.2似乎不包含從Æ到AE的標準映射，並且已經從「字母」變爲「連字」並且再次返回（U + 00C6是1.0中的LATIN CAPITAL LETTER A E,1.1中的LATIN CAPITAL LIGATURE AE,1.0中的LATIN CAPITAL LETTER AE）。您可以在NamesList.txt中搜索「連字」，並添加一些特殊情況。

注：

CFStringNormalize()不會做你想做的。你做想要正常化字符串之前，將它們添加到索引;我建議NFKC在其他處理的開始和結束。
CFStringTransform()並不完全符合你的要求;所有腳本都是「拉丁文」
CFStringFold()是依賴於訂單的：合併ypogegrammeni and prosgegrammeni被刪除kCFCompareDiacriticInsensitive，但被kCFCompareCaseInsensitive轉換爲小寫iota。「正確」的東西似乎是先摺疊起來，其次是其他摺疊，儘管剝離它可能會使語言更有意義。
你幾乎肯定不想使用kCFCompareLocalized，除非你想在每次語言環境改變時重建搜索索引。

從其他語言讀者注意：檢查您所使用的功能不依賴於用戶的當前區域！ Java用戶應該使用類似於s.toUpperCase(Locale.ENGLISH)的東西，.NET用戶應該使用s.ToUpperInvariant()。如果您確實需要用戶的當前語言環境，請明確指定它。

來源

2013-03-18 20:10:44

+1 **太棒了！**我已經得出結論，我永遠無法得到這個問題的答案。我不再處理這個問題，所以我需要一些時間來充分理解這個問題 - 我想我週末有一些閱讀的內容！ – danyowdee 2013-03-19 13:03:03

我在字符串上使用了以下擴展，看起來很好地工作。

/// normalized version of string for comparisons and database lookups. If normalization fails or results in an empty string, original string is returned. 
var normalized: String? { 
    // expand ligatures and other joined characters and flatten to simple ascii (æ => ae, etc.) by converting to ascii data and back 
    guard let data = self.data(using: String.Encoding.ascii, allowLossyConversion: true) else { 
     print("WARNING: Unable to convert string to ASCII Data: \(self)") 
     return self 
    } 
    guard let processed = String(data: data, encoding: String.Encoding.ascii) else { 
     print("WARNING: Unable to decode ASCII Data normalizing stirng: \(self)") 
     return self 
    } 
    var normalized = processed 

    // // remove non alpha-numeric characters 
    normalized = normalized.replacingOccurrences(of: "?", with: "") // educated quotes and the like will be destroyed by above data conversion 
    // strip appostrophes 
    normalized = normalized.replacingCharacters(in: "'", with: "") 
    // replace non-alpha-numeric characters with spaces 
    normalized = normalized.replacingCharacters(in: CharacterSet.alphanumerics.inverted, with: " ") 
    // lowercase string 
    normalized = normalized.lowercased() 

    // remove multiple spaces and line breaks and tabs and trim 
    normalized = normalized.whitespaceCollapsed 

    // may return an empty string if no alphanumeric characters! In this case, use the raw string as the "normalized" form 
    if normalized == "" { 
     return self 
    } else { 
     return normalized 
    } 
}

來源

2016-09-18 02:09:30 Gujamin

摺疊/標準化連字（例如Æ到ae）使用（Core）基礎

回答

相關問題