回答
嘗試與 'U' 選項正則表達式,例如
$chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
這隻會使用UTF-8編碼。 – 2014-04-09 15:31:52
同意Petr。我用BIG5試過了,它不起作用! – 2014-11-28 05:39:17
醜陋的方式來做到這一點是:
mb_internal_encoding("UTF-8"); // this IS A MUST!! PHP has trouble with multibyte
// when no internal encoding is set!
$string = ".....";
$chars = array();
for ($i = 0; $i < mb_strlen($string); $i++) {
$chars[] = mb_substr($string, $i, 1); // only one char to go to the array
}
你也應該嘗試用自己的方式與mb_split與之前設置internal_encoding。
可以使用字形函數(PHP 5.3或國際1.0)和IntlBreakIterator(PHP 5.5或國際3.0)。以下代碼顯示了intl和mbstring和PCRE函數之間的區別。
// http://www.php.net/manual/function.grapheme-strlen.php
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6)
$expected = ["a\xCC\x8A", "o\xCC\x88"];
$expected2 = ["a", "\xCC\x8A", "o", "\xCC\x88"];
var_dump(
$expected === str_to_array($string),
$expected === str_to_array2($string),
$expected2 === str_to_array3($string),
$expected2 === str_to_array4($string),
$expected2 === str_to_array5($string)
);
function str_to_array($string)
{
$length = grapheme_strlen($string);
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = grapheme_substr($string, $i, 1);
}
return $ret;
}
function str_to_array2($string)
{
$it = IntlBreakIterator::createCharacterInstance('en_US');
$it->setText($string);
$ret = [];
$prev = 0;
foreach ($it as $pos) {
$char = substr($string, $prev, $pos - $prev);
if ('' !== $char) {
$ret[] = $char;
}
$prev = $pos;
}
return $ret;
}
function str_to_array3($string)
{
$it = IntlBreakIterator::createCodePointInstance();
$it->setText($string);
$ret = [];
$prev = 0;
foreach ($it as $pos) {
$char = substr($string, $prev, $pos - $prev);
if ('' !== $char) {
$ret[] = $char;
}
$prev = $pos;
}
return $ret;
}
function str_to_array4($string)
{
$length = mb_strlen($string, "UTF-8");
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = mb_substr($string, $i, 1, "UTF-8");
}
return $ret;
}
function str_to_array5($string) {
return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}
在生產環境中工作時,您需要更換與替代字符無效字節序列,因爲幾乎所有的字形和MBSTRING功能無法處理無效的字節序列。如果您有興趣,請參閱我過去的回答:https://stackoverflow.com/a/13695364/531320
如果您不考慮性能,可以使用htmlspecialchars和htmlspecialchars_decode。這種方式的優點是支持UTF-8以外的各種編碼。
function str_to_array6($string, $encoding = 'UTF-8')
{
$ret = [];
str_replace_callback($string, function($char, $index) use (&$ret) { $ret[] = $char; return ''; }, $encoding);
return $ret;
}
function str_replace_callback($string, $callable, $encoding = 'UTF-8')
{
$str_size = strlen($string);
$string = str_scrub($string, $encoding);
$ret = '';
$char = '';
$index = 0;
for ($pos = 0; $pos < $str_size; ++$pos) {
$char .= $string[$pos];
if (str_check_encoding($char, $encoding)) {
$ret .= $callable($char, $index);
$char = '';
++$index;
}
}
return $ret;
}
function str_check_encoding($string, $encoding = 'UTF-8')
{
$string = (string) $string;
return $string === htmlspecialchars_decode(htmlspecialchars($string, ENT_QUOTES, $encoding));
}
function str_scrub($string, $encoding = 'UTF-8')
{
return htmlspecialchars_decode(htmlspecialchars($string, ENT_SUBSTITUTE, $encoding));
}
如果你想了解UTF-8的規範,字節操作是一種很好的練習方式。
function str_to_array6($string)
{
// REPLACEMENT CHARACTER (U+FFFD)
$substitute = "\xEF\xBF\xBD";
$size = strlen($string);
$ret = [];
for ($i = 0; $i < $size; $i += 1) {
if ($string[$i] <= "\x7F") {
$ret[] = $string[$i];
} elseif ("\xC2" <= $string[$i] && $string[$i] <= "\xDF") {
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < "\x80" || "\xBF" < $string[$i+1]) {
$ret[] = $substitute;
} else {
$ret[] = substr($string, $i, 2);
$i += 1;
}
} elseif ("\xE0" <= $string[$i] && $string[$i] <= "\xEF") {
$left = "\xE0" === $string[$i] ? "\xA0" : "\x80";
$right = "\xED" === $string[$i] ? "\x9F" : "\xBF";
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < $left || $right < $string[$i+1]) {
$ret[] = $substitute;
} else {
if (!isset($string[$i+2])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {
$ret[] = $substitute;
$i += 1;
} else {
$ret[] = substr($string, $i, 3);
$i += 2;
}
}
} elseif ("\xF0" <= $string[$i] && $string[$i] <= "\xF4") {
$left = "\xF0" === $string[$i] ? "\x90" : "\x80";
$right = "\xF4" === $string[$i] ? "\x8F" : "\xBF";
if (!isset($string[$i+1])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+1] < $left || $right < $string[$i+1]) {
$ret[] = $substitute;
} else {
if (!isset($string[$i+2])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+2] < "\x80" || "\xBF" < $string[$i+2]) {
$ret[] = $substitute;
$i += 1;
} else {
if (!isset($string[$i+3])) {
$ret[] = $substitute;
return $ret;
} elseif ($string[$i+3] < "\x80" || "\xBF" < $string[$i+3]) {
$ret[] = $substitute;
$i += 2;
} else {
$ret[] = substr($string, $i, 4);
$i += 3;
}
}
}
} else {
$ret[] = $substitute;
}
}
return $ret;
}
這些函數之間的基準測試結果就在這裏。
grapheme
0.12967610359192
IntlBreakIterator::createCharacterInstance
0.17032408714294
IntlBreakIterator::createCodePointInstance
0.079245090484619
mbstring
0.081080913543701
preg_split
0.043133974075317
htmlspecialchars
0.25599694252014
byte maniplulation
0.13132810592651
基準代碼在這裏。
$string = '主樓怎麼走';
foreach (timer([
'grapheme' => 'str_to_array',
'IntlBreakIterator::createCharacterInstance' => 'str_to_array2',
'IntlBreakIterator::createCodePointInstance' => 'str_to_array3',
'mbstring' => 'str_to_array4',
'preg_split' => 'str_to_array5',
'htmlspecialchars' => 'str_to_array6',
'byte maniplulation' => 'str_to_array7'
],
[$string]) as $desc => $time) {
echo $desc, PHP_EOL,
$time, PHP_EOL;
}
function timer(array $callables, array $arguments, $repeat = 10000) {
$ret = [];
$save = $repeat;
foreach ($callables as $key => $callable) {
$start = microtime(true);
do {
array_map($callable, $arguments);
} while($repeat -= 1);
$stop = microtime(true);
$ret[$key] = $stop - $start;
$repeat = $save;
}
return $ret;
}
- 1. 將CSV字符串拆分爲PHP中的多個字符串
- 2. PHP:將多字節字符串拆分爲沒有空格的字符
- 3. 將字符串拆分爲字符串
- 4. 將字符串拆分爲字符串
- 5. 將字符串N拆分爲4個不同的字符串
- 6. 將lua字符串拆分爲字符
- 7. C#拆分字符串 - 將字符串拆分爲數組
- 8. 將字符串拆分爲多個較小的字符串
- 9. 將字符串拆分爲「|」
- 10. 如何將字符串拆分爲字母字符串和數字字符串?
- 11. 將unicode字符串拆分爲300字節的塊而不破壞字符
- 12. Android-將字符串拆分爲多個字符串
- 13. 將字符串拆分爲多個字符串組
- 14. 將字符串生成器拆分爲字符串字符串特定字符
- 15. Ruby:將字符串拆分爲最多40個字符的子字符串
- 16. 如何將字符串與不同字符之間的字符串拆分?
- 17. 將字符串拆分爲字典
- 18. Elisp拆分字符串函數來拆分字符串。字符
- 19. 使用多字符字符串的拆分字符串
- 20. PHP:如何將一個數字拆分爲多個字符串
- 21. 將字符串拆分爲多個字符串(當某個字符發生時)
- 22. 將字符串拆分爲多行
- 23. PHP,將字符串拆分爲n個字符
- 24. 將字符串拆分爲帶有瑞典字符的字
- 25. python如何將字符串拆分爲多個字符?
- 26. 根據字符寬度將字符串拆分爲多行(python)
- 27. 如何將unicode字符串拆分爲多個字符?
- 28. PHP拆分字符串
- 29. PHP拆分字符串
- 30. PHP - 拆分字符串
退房http://stackoverflow.com/questions/1032674/string-to-array-and-back-php – Smandoli 2010-03-31 20:37:58
請注意,這是一個多字節字符串。 – Peterim 2010-03-31 20:46:18