從UTF-8格式字符串中提取雙字節字符/子字符串

我試圖從字符串中提取emojis和其他特殊字符以進行進一步處理（例如，字符串包含''作爲其字符之一）。從UTF-8格式字符串中提取雙字節字符/子字符串

但是string.charAt(i)和string.substring(i, i+1)都不適用於我。原始字符串採用UTF-8格式，這意味着上述表情符號的轉義形式被編碼爲'\ uD83D \ uDE05'。這就是爲什麼我收到'？' （\ uD83D）和'？' （\ uDE05）而不是這個位置，導致它在迭代字符串時位於兩個位置。

有沒有人有解決這個問題的辦法？

來源

2015-06-14 conidium

對於UTF-16編碼使用'str.getBytes（「UTF-16」） ;' – Cyrbil

您需要使用**代碼點**而不是'char's。表情符號不適合16位「char」。請參閱[Java 16位字符如何支持Unicode？]（http://stackoverflow.com/questions/1941613/how-does-java-16-bit-chars-support-unicode）以及[我如何遍歷Unicode一個Java字符串的代碼點？]（http://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string）。 –

@cyrbil這有什麼用？ –

感謝John Kugelman的幫助。該解決方案看起來現在這個樣子：

for(int codePoint : codePoints(string)) { 

     char[] chars = Character.toChars(codePoint); 
     System.out.println(codePoint + " : " + String.copyValueOf(chars)); 

    }

隨着代碼點（字符串字符串） - 方法看起來像這樣：

private static Iterable<Integer> codePoints(final String string) { 
    return new Iterable<Integer>() { 
     public Iterator<Integer> iterator() { 
      return new Iterator<Integer>() { 
       int nextIndex = 0; 

       public boolean hasNext() { 
        return nextIndex < string.length(); 
       } 

       public Integer next() { 
        int result = string.codePointAt(nextIndex); 
        nextIndex += Character.charCount(result); 
        return result; 
       } 

       public void remove() { 
        throw new UnsupportedOperationException(); 
       } 
      }; 
     } 
    }; 
}

來源

2015-06-15 06:24:24 conidium

從UTF-8格式字符串中提取雙字節字符/子字符串

回答

相關問題