2009-10-21 70 views
5

鑑於像轉換爲突出拼音?

nin2 hao3 ma 

源文本(這是寫ASCII拼音典型的方式,沒有適當強調字符) 並給予(UTF8)轉換表像

a1;ā 
e1;ē 
i1;ī 
o1;ō 
u1;ū 
ü1;ǖ 
A1;Ā 
E1;Ē 
... 

怎麼會有我將源文本轉換爲

nín hǎo ma 

爲什麼值得使用PHP,這可能是我正在研究的一個正則表達式?

+0

Additional info for those looking into this: (from the wikipedia article http://en.wikipedia.org/wiki/Pinyin) An algorithm to find the correct vowel letter (when there is more than one) is as follows: 1. If there is an "a" or an "e", it will take the tone mark. 2. If there is an "ou", then the "o" takes the tone mark. 3. Otherwise, the second vowel takes the tone mark. – 2009-10-21 15:21:35

回答

1
<?php 
$in = 'nin2 hao3 ma'; 
$out = 'nín hǎo ma'; 

function replacer($match) { 
    static $trTable = array(
    1 => array(
     'a' => 'ā', 
     'e' => 'ē', 
     'i' => 'ī', 
     'o' => 'ō', 
     'u' => 'ū', 
     'ü' => 'ǖ', 
     'A' => 'Ā', 
     'E' => 'Ē'), 
    2 => array('i' => 'í'), 
    3 => array('a' => 'ǎ') 
); 
    list(, $word, $i) = $match; 
    return str_replace(
    array_keys($trTable[$i]), 
    array_values($trTable[$i]), 
    $word); } 

// Outputs: bool(true) 
var_dump(preg_replace_callback('~(\w+)(\d+)~', 'replacer', $in) === $out); 
+0

Impressive. Thanks! – 2009-10-23 02:44:11

+1

does this really take into account all the cases? what about 4th tone? what about 'a' in 2nd tone? isn't there something special with 'ou'? or were you just trying to provide a piece of code that needs to be expanded? – philfreo 2010-11-10 02:18:32

+0

So what about diphthongs? E.g. liao3 --> liǎo. How does this function figure those out? – 2015-06-28 12:11:38

9

Ollie的算法是一個不錯的開始,但它沒有正確應用標記。例如,qiao1變成qīāō。這一個是正確和完整的。你可以很容易地看到替換規則是如何定義的。

它也爲音調5做了整個事情,雖然它不影響輸出,除了刪除數字。我離開它,如果你想要做的事與音5

該算法的工作原理如下:

  • 單詞和語調在$比賽提供[1] [2]
  • 字母后面加上一個星號,該字母應該帶有重音標記
  • 帶有星號的字母被帶有正確音調標記的字母替換。

實施例:

qiao => (iao becomes ia*o) => qia*o => qiǎo

這種策略,以及使用strtr(其優先更長替換),可以確保這將不會發生:

qiao1 =>巧


function pinyin_addaccents($string) { 
    # Find words with a number behind them, and replace with callback fn. 
    return preg_replace_callback(
     '~([a-zA-ZüÜ]+)(\d)~', 
     'pinyin_addaccents_cb', 
     $string); 
} 

# Helper callback 
function pinyin_addaccents_cb($match) { 
    static $accentmap = null; 

    if($accentmap === null) { 
     # Where to place the accent marks 
     $stars = 
      'a* e* i* o* u* ü* '. 
      'A* E* I* O* U* Ü* '. 
      'a*i a*o e*i ia* ia*o ie* io* iu* '. 
      'A*I A*O E*I IA* IA*O IE* IO* IU* '. 
      'o*u ua* ua*i ue* ui* uo* üe* '. 
      'O*U UA* UA*I UE* UI* UO* ÜE*'; 
     $nostars = str_replace('*', '', $stars); 

     # Build an array like Array('a' => 'a*') and store statically 
     $accentmap = array_combine(explode(' ',$nostars), explode(' ', $stars)); 
     unset($stars, $nostars); 
    } 

    static $vowels = 
     Array('a*','e*','i*','o*','u*','ü*','A*','E*','I*','O*','U*','Ü*'); 

    static $pinyin = Array(
     1 => Array('ā','ē','ī','ō','ū','ǖ','Ā','Ē','Ī','Ō','Ū','Ǖ'), 
     2 => Array('á','é','í','ó','ú','ǘ','Á','É','Í','Ó','Ú','Ǘ'), 
     3 => Array('ǎ','ě','ǐ','ǒ','ǔ','ǚ','Ǎ','Ě','Ǐ','Ǒ','Ǔ','Ǚ'), 
     4 => Array('à','è','ì','ò','ù','ǜ','À','È','Ì','Ò','Ù','Ǜ'), 
     5 => Array('a','e','i','o','u','ü','A','E','I','O','U','Ü') 
    ); 

    list(,$word,$tone) = $match; 
    # Add star to vowelcluster 
    $word = strtr($word, $accentmap); 
    # Replace starred letter with accented 
    $word = str_replace($vowels, $pinyin[$tone], $word); 
    return $word; 
} 
+0

that works great! thanks so much! I have the following issue however: The 'Ǖ' are written like this in my data: 'lu:4' How can I integrate that here? – uncovery 2011-09-21 07:36:53

+0

I helped myself now by addind '$string = str_replace(array('u:', 'U:'), array('ü','Ü'),$string);' before 'return preg_replace_callback( ' – uncovery 2011-09-21 09:50:02

+0

This is great; thanks! – Jen 2013-06-07 08:15:55

1

對於.NET解決方案嘗試Pinyin4j.NET

功能 將中文(簡體和繁體)轉換爲最流行的拼音系統。下面列出了支持拼音的系統。

  • Hanyu Pinyin 漢語拼音
  • Tongyong Pinyin 通用拼音
  • Wade-Giles 威妥瑪拼音
  • MPS2 (Mandarin Phonetic Symbols II) 國語注音符號第二式
  • Yale Romanization 耶魯羅馬化拼音
  • Gwoyeu Romatzyh國語國語羅馬化拼音
0

VB Macro (Libre)Office : Convert pinyin tone numbers to accents

Hopefully the algorithm is correct accordingly to pinyin rules specially for i and u.

sub replaceNumberByTones 

    call PinyinTonesNumber("a([a-z]*[a-z]*)0", "a$1") 
    call PinyinTonesNumber("a([a-z]*[a-z]*)1", "a$1") 
    call PinyinTonesNumber("a([a-z]*[a-z]*)2", "á$1") 
    call PinyinTonesNumber("a([a-z]*[a-z]*)3", "a$1") 
    call PinyinTonesNumber("a([a-z]*[a-z]*)4", "à$1") 

    call PinyinTonesNumber("o([a-z]*[a-z]*)0", "o$1") 
    call PinyinTonesNumber("o([a-z]*[a-z]*)1", "o$1") 
    call PinyinTonesNumber("o([a-z]*[a-z]*)2", "ó$1") 
    call PinyinTonesNumber("o([a-z]*[a-z]*)3", "o$1") 
    call PinyinTonesNumber("o([a-z]*[a-z]*)4", "ò$1") 

    call PinyinTonesNumber("e([a-z]*[a-z]*)0", "e$1") 
    call PinyinTonesNumber("e([a-z]*[a-z]*)1", "e$1") 
    call PinyinTonesNumber("e([a-z]*[a-z]*)2", "é$1") 
    call PinyinTonesNumber("e([a-z]*[a-z]*)3", "e$1") 
    call PinyinTonesNumber("e([a-z]*[a-z]*)4", "è$1") 

    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)0", "u$1") 
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)1", "u$1") 
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)2", "ú$1") 
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)3", "u$1") 
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)4", "ù$1") 

    call PinyinTonesNumber("i([a-z]*[a-z]*)0", "i$1") 
    call PinyinTonesNumber("i([a-z]*[a-z]*)1", "i$1") 
    call PinyinTonesNumber("i([a-z]*[a-z]*)2", "í$1") 
    call PinyinTonesNumber("i([a-z]*[a-z]*)3", "i$1") 
    call PinyinTonesNumber("i([a-z]*[a-z]*)4", "ì$1") 

    End sub 

    sub PinyinTonesNumber(expression, replacement) 
    rem ---------------------------------------------------------------------- 
    rem define variables 
    dim document as object 
    dim dispatcher as object 
    rem ---------------------------------------------------------------------- 
    rem get access to the document 
    document = ThisComponent.CurrentController.Frame 
    dispatcher = createUnoService("com.sun.star.frame.DispatchHelper") 

    rem ---------------------------------------------------------------------- 
    dim args1(18) as new com.sun.star.beans.PropertyValue 
    args1(0).Name = "SearchItem.StyleFamily" 
    args1(0).Value = 2 
    args1(1).Name = "SearchItem.CellType" 
    args1(1).Value = 0 
    args1(2).Name = "SearchItem.RowDirection" 
    args1(2).Value = true 
    args1(3).Name = "SearchItem.AllTables" 
    args1(3).Value = false 
    args1(4).Name = "SearchItem.Backward" 
    args1(4).Value = false 
    args1(5).Name = "SearchItem.Pattern" 
    args1(5).Value = false 
    args1(6).Name = "SearchItem.Content" 
    args1(6).Value = false 
    args1(7).Name = "SearchItem.AsianOptions" 
    args1(7).Value = false 
    args1(8).Name = "SearchItem.AlgorithmType" 
    args1(8).Value = 1 
    args1(9).Name = "SearchItem.SearchFlags" 
    args1(9).Value = 65536 
    args1(10).Name = "SearchItem.SearchString" 
    args1(10).Value = expression 
    args1(11).Name = "SearchItem.ReplaceString" 
    args1(11).Value = replacement 
    args1(12).Name = "SearchItem.Locale" 
    args1(12).Value = 255 
    args1(13).Name = "SearchItem.ChangedChars" 
    args1(13).Value = 2 
    args1(14).Name = "SearchItem.DeletedChars" 
    args1(14).Value = 2 
    args1(15).Name = "SearchItem.InsertedChars" 
    args1(15).Value = 2 
    args1(16).Name = "SearchItem.TransliterateFlags" 
    args1(16).Value = 1280 
    args1(17).Name = "SearchItem.Command" 
    args1(17).Value = 3 
    args1(18).Name = "Quiet" 
    args1(18).Value = true 

    dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args1()) 


    end sub 

Hope this helps someone

François

1

To add a javascript solution:

This code places Tonemarks according to the official algorithm for placing one, see wikipedia .

Hope that helps some of you, suggestions and improvements wellcome!

var ACCENTED = { 
      '1': {'a': '\u0101', 'e': '\u0113', 'i': '\u012B', 'o': '\u014D', 'u': '\u016B', 'ü': '\u01D6'}, 
      '2': {'a': '\u00E1', 'e': '\u00E9', 'i': '\u00ED', 'o': '\u00F3', 'u': '\u00FA', 'ü': '\u01D8'}, 
      '3': {'a': '\u01CE', 'e': '\u011B', 'i': '\u01D0', 'o': '\u01D2', 'u': '\u01D4', 'ü': '\u01DA'}, 
      '4': {'a': '\u00E0', 'e': '\u00E8', 'i': '\u00EC', 'o': '\u00F2', 'u': '\u00F9', 'ü': '\u01DC'}, 
      '5': {'a': 'a', 'e': 'e', 'i': 'i', 'o': 'o', 'u': 'u', 'ü': 'ü'} 
    }; 

    function getPos (token) { 
      if (token.length === 1){ 
       // only one letter, nothing to differentiate 
       return 0; 
      } 
      var precedence = ['a', 'e', 'o']; 
      for (i=0; i<precedence.length; i += 1){ 
       var pos = token.indexOf(precedence[i]); 
       // checking a before o, will take care of ao automatically 
       if (pos >= 0){ 
        return pos; 
       } 
      } 
      var u = token.indexOf('u'); 
      var i = token.indexOf('i'); 
      if (i < u){ 
       // -iu OR u-only case, accent goes to u 
       return u; 
      } else { 
       // -ui OR i-only case, accent goes to i 
       return i; 
      } 
      // the only vowel left is ü 
      var ü = token.indexOf('ü'); 
      if (ü >= 0){ 
       return ü; 
      } 
     } 

    //and finally: 
    // we asume the input to be valid PinYin, therefore no security checks.... 
    function placeTone(numbered_PinYin){ 
       var ToneIndex = numbered_PinYin.charAt(numbered_PinYin.length -1); 
       var accentpos = getPos(numbered_PinYin); 
       var accented_Char = ACCENTED[ToneIndex][numbered_PinYin.charAt(accentpos)]; 

       var accented_PinYin = ""; 
       if (accentpos === 0){ 
        // minus one to trimm the number off 
        accented_PinYin = accented_Char + numbered_PinYin.substr(1, numbered_PinYin.length-1); 
       } else { 
        var before = numbered_PinYin.substr(0, accentpos); 
        var after = numbered_PinYin.substring(accentpos+1, numbered_PinYin.length-1); 
        accented_PinYin = before + accented_Char + after; 
       } 
       return accented_PinYin; 
    } 

    console.log(placeTone('han4 zi4'));