Java - 無法在BufferedReader中正確讀取特殊字符

我創建了從csv文件讀取數據的代碼。但是，我無法處理諸如£的特殊字符。Java - 無法在BufferedReader中正確讀取特殊字符

例如，My Base Cost (K£)正在被讀作My Base Cost (KÃ‚Â£)。

我該怎麼做才能糾正這個問題？

public void parseCSVFile(String filename){ 

    try { 
      br = new BufferedReader(new FileReader(csvDirectory + filename)); 

      while ((parsedLines = br.readLine()) != null) { 

       String[] parsedData = parsedLines.split(csvSplitByComma); 

       entireFeed.add(parsedData[0]); 
       entireFeed.add(parsedData[1]); 

       System.out.println(parsedData[0]); 
       System.out.println(parsedData[1]); 

       it = entireFeed.iterator(); 
      } 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
}

來源

2016-11-16 NSC

可能重複的http://stackoverflow.com/questions/9281629/read-special-characters-in-java-with-bufferedreader –

@NiranjanKumar我試過以下，它仍然無法正常工作。我回來了「我的基本成本（KÃƒÂ,Ã£Â£）」：BufferedReader br = new BufferedReader（ new InputStreamReader（new FileInputStream（file），「ISO-8859-1」））; – NSC

[讀/寫.txt文件與特殊字符]可能的副本（http://stackoverflow.com/questions/4597749/read-write-txt-file-with-special-characters） –

是寫你的CSV是斷碼。它用UTF-8編碼的文本編碼爲三重編碼。

在UTF-8中，ASCII字符（代碼點0-127）表示爲單個字節;他們不需要編碼。這就是爲什麼只有£受到影響。

£需要UTF-8中的兩個字節。那些字節是：0xc2,0xa3。如果編寫CSV文件的代碼正確使用了UTF-8，則該字符將顯示爲文件中的兩個字節。

但是，顯然，某些代碼在某處使用一個字節的字符集（如ISO-8859-1）讀取文件，導致每個單獨的字節被視爲其自己的字符。然後它使用UTF-8編碼這些單獨的字符。意思是，它以{0xc2，0xa3}個字節和編碼爲的UTF-8編碼爲。這反過來產生這些字節：0xc3,0x82,0xc2,0xa3。（具體而言：U + 00C2字符在UTF-8中表示爲0xc3 0x82，而U + 00A3字符在UTF-8中表示爲0xc2 0xa3。）

然後，在此之後的某個時間，完成了相同的操作再次使用。這四個字節是使用一個字節的字符集讀取的，每個字節都被視爲它自己的字符，並且這四個字符中的每一個都以UTF-8編碼，導致八個字節：0xc3,0x83,0xc2,0x82,0xc3 ，0x82，0xc2，0xa3。（當編碼爲UTF-8時，並非每個字符都轉換爲兩個字節;只是發生所有這些字符都是）

這就是爲什麼當您使用ISO-8859-1字符集讀取文件時，得到一個字符，每個字節：

Ã ƒ Â ‚ Ã ‚ Â £ 
c3 83 c2 82 c3 82 c2 a3

（從技術上講，‚實際上是U + 201A「單低9引號，」但許多單字節每字符Windows字體在歷史上有這樣的字符在位置0x82。）

所以，現在我們知道你的文件是如何得到的，你對此做了什麼？

首先，不要讓它變得更糟。如果您可以控制寫入文件的代碼，請確保代碼明確指定了讀寫字符集。 UTF-8幾乎總是最好的選擇，至少對於任何主要使用西方字符的文件來說。

二，如何修復文件？恐怕沒有辦法自動檢測這種錯誤編碼，但至少在這個文件的情況下，您可以對它進行三重解碼。

如果文件不是很大，你可以閱讀所有入內存：

byte[] bytes = Files.readAllBytes(Paths.get(csvDirectory, filename)); 
// First decoding: £ is represented as four characters 
String content = new String(bytes, "UTF-8"); 

bytes = new byte[content.length()]; 
for (int i = content.length() - 1; i >= 0; i--) { 
    bytes[i] = (byte) content.charAt(i); 
} 
// Second decoding: £ is represented as two characters 
content = new String(bytes, "UTF-8"); 

bytes = new byte[content.length()]; 
for (int i = content.length() - 1; i >= 0; i--) { 
    bytes[i] = (byte) content.charAt(i); 
} 
// Third decoding: £ is represented as one character 
content = new String(bytes, "UTF-8"); 

br = new BufferedReader(new StringReader(content)); 

// ...

如果它是一個大的文件，你將要閱讀每一行字節：

try (InputStream in = new BufferedInputStream(
    Files.newInputStream(Paths.get(csvDirectory, filename)))) { 

    ByteBuffer lineBuffer = ByteBuffer.allocate(64 * 1024); 

    int b = 0; 
    while (b >= 0) { 
     lineBuffer.clear(); 

     for (b = in.read(); 
      b >= 0 && b != '\n' && b != '\r'; 
      b = in.read()) { 

      lineBuffer.put((byte) b); 
     } 

     if (b == '\r') { 
      in.mark(1); 
      if (in.read() != '\n') { 
       in.reset(); 
      } 
     } 

     lineBuffer.flip(); 
     byte[] bytes = new byte[lineBuffer.limit()]; 
     lineBuffer.get(bytes); 

     // First decoding: £ is represented as four characters 
     String parsedLine = new String(bytes, "UTF-8"); 

     bytes = new byte[parsedLine.length()]; 
     for (int i = parsedLine.length() - 1; i >= 0; i--) { 
      bytes[i] = (byte) parsedLine.charAt(i); 
     } 
     // Second decoding: £ is represented as two characters 
     parsedLine = new String(bytes, "UTF-8"); 

     bytes = new byte[parsedLine.length()]; 
     for (int i = parsedLine.length() - 1; i >= 0; i--) { 
      bytes[i] = (byte) parsedLine.charAt(i); 
     } 
     // Third decoding: £ is represented as one character 
     parsedLine = new String(bytes, "UTF-8"); 

     // ... 
    } 
}

來源

2016-11-16 18:45:46 VGR

感謝您的解釋，這對我在哪裏出錯是有道理的。我已經糾正了我的代碼，它現在按預期工作。 – NSC

看起來像一個編碼問題。找出你的文件編碼的字符集。假設編碼是UTF-8，你可以做這樣的事情

new BufferedReader(new InputStreamReader(new FileInputStream("my/path/to/File"), "UTF-8"));

這應該解決您的問題

來源

2016-11-16 14:39:33

Java - 無法在BufferedReader中正確讀取特殊字符

回答

相關問題