該類應檢查currentFile
並檢測編碼。如果結果是UTF-8 return true
。OutOfMemoryError - 來自檢測UTF-8編碼
runnig後輸出爲 - java.lang.OutOfMemoryError: Java heap space
。
對於讀取數據,需要有JDK 7此Files.readAllBytes(path)
代碼:
class EncodingsCheck implements Checker {
@Override
public boolean check(File currentFile) {
return isUTF8(currentFile);
}
public static boolean isUTF8(File file) {
// validate input
if (null == file) {
throw new IllegalArgumentException("input file can't be null");
}
if (file.isDirectory()) {
throw new IllegalArgumentException(
"input file refers to a directory");
}
// read input file
byte[] buffer;
try {
buffer = readUTFHeaderBytes(file);
} catch (IOException e) {
throw new IllegalArgumentException(
"Can't read input file, error = " + e.getLocalizedMessage());
}
if (0 == (buffer[0] & 0x80)) {
return true; // ASCII subset character, fast path
} else if (0xF0 == (buffer[0] & 0xF8)) { // start of 4-byte sequence
if (buffer[3] >= buffer.length) {
return false;
}
if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))
&& (0x80 == (buffer[3] & 0xC0)))
return true;
} else if (0xE0 == (buffer[0] & 0xF0)) { // start of 3-byte sequence
if (buffer[2] >= buffer.length) {
return false;
}
if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))) {
return true;
}
} else if (0xC0 == (buffer[0] & 0xE0)) { // start of 2-byte sequence
if (buffer[1] >= buffer.length) {
return false;
}
if (0x80 == (buffer[1] & 0xC0)) {
return true;
}
}
return false;
}
private static byte[] readUTFHeaderBytes(File input) throws IOException {
// read data
Path path = Paths.get(input.getAbsolutePath());
byte[] data = Files.readAllBytes(path);
return data;
}
}
問:
- 如何解決這個問題?
- 如何在這種方式檢查UTF-16 (需要我們擔心這或這只是無用的麻煩)?
如果您只需要前四個字節的文件來檢測標題的,你爲什麼要讀整個文件到內存?想想如果這是一個1GB的文件會發生什麼。 – mellamokb 2013-03-10 19:15:58
@mellamokb如何規避這種資源昂貴的過程? – 2013-03-10 19:22:56
@nazar_art搜索「Java閱讀文件教程」。找到一個討論'InputStream'的問題。 – 2013-03-10 19:24:46