識別並匹配文件中的非ASCII字符

我正在嘗試讀取分隔文件並解析其內容。與CSV不同，分隔符，字符串限定符等是非ASCII的ie。 U0014和U00FE。但是，我無法檢測到字符串限定符（FE）。這是因爲角色的價值是更大還是其他？識別並匹配文件中的非ASCII字符

下面是一個說明核心問題的簡單程序。我該如何做這項工作？這是一個非常小的測試文件的鏈接。 https://www.dropbox.com/s/1cilircwc3pq78c/nonascii.dat?dl=0

感謝

import org.apache.commons.io.FileUtils; 
import org.apache.commons.io.LineIterator; 
import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.InputStreamReader; 
import java.io.PrintStream; 
import java.io.Reader; 

public class CharMatch { 
    public static void main(String[] args) 
     throws Exception { 
    final String pathname = "/home/vinayb/Downloads/nonascii.dat"; 
    final File file = new File(pathname); 
    final String encoding = "UTF-8"; 
    final PrintStream out = new PrintStream(System.out, true, encoding); 
    final Reader r = new BufferedReader(new InputStreamReader(
      new FileInputStream(file), encoding)); 

    final LineIterator it = FileUtils.lineIterator(file, encoding); 
    try { 
     //read a line 
     final String line = it.nextLine(); 
     final char[] chars = line.toCharArray(); 
     for (char c : chars) { 
      out.println(c + " , with decimal value: " + Character.getNumericValue(c) + " and hexa value: " + Integer.toHexString(Character.getNumericValue(c))); 
     } 

     out.println("------------------------------------"); 
     final String expectedDelimiter = fromUnicode("0014"); 
     final String expectedStringQualifier = fromUnicode("00FE"); 
     out.println("##### expected delimiter:" + expectedDelimiter); 

     out.println("##### expected string qualifier:" + expectedStringQualifier); 
     String[] items = line.split(expectedDelimiter); 
     out.println("#### " + items.length + " " + items[0]); 

     if (line.contains(expectedDelimiter)) { 
      out.println("Found delimiter"); ////=======> can match this 
     } 

     if (line.contains(expectedStringQualifier)) { 
      out.println("Found string qualifier"); //=======> can't match this 
     } 
    } finally { 
     LineIterator.closeQuietly(it); 
    } 
} 

private static String fromUnicode(String codePoint) { 
    return "" + (char) Integer.parseInt(codePoint, 16); 
}

}

來源

2015-04-01 Vinay B

「string qualifier character」？那應該是什麼？ – fge 2015-04-01 21:35:01

這是一個用來限定字符串的字符。一個常用的分隔符是「。例如在csv中，我們將使用分隔符，因此''John Doe」，「123，Main Street」'。在這種情況下，分隔符是00FE。請看這個鏈接看起來像什麼樣的http ：//en.wikipedia.org/wiki/ISO/IEC_8859-1 – 2015-04-01 21:40:57

您的文件是無效的UTF-8：

$ iconv -f utf-8 *dat >/dev/null; echo $? 
iconv: illegal input sequence at position 0 
1

但它可以「讀」爲ISO-8859-1：

$ iconv -f iso-8859-1 *dat >/dev/null; echo $? 
0

Just chan將編碼轉換爲該編碼;但2015年這樣的文件格式相當奇怪。你真正應該做的是要求這些文件的來源與時俱進;）

請注意，由於第一個字節序列無效，因此默認情況下Java會用U+FFFD替代它;並且它會對每個字節序列進行處理，所以無法轉換爲char s。即使在這種情況下，爲了讓Java拋出一個異常，您將需要實例化一個CharsetDecoder（來自Charset實例）並指定您想要.onMalformedInput(CodingErrorAction.REPORT)（默認值爲CodingErrorAction.REPLACE）。

來源

2015-04-01 21:40:17 fge

我使用ISO-8859-1格式，允許我讀取文件 – 2015-04-07 22:06:57

看一看here。 00 FE可能是UTF-16的正確代碼，但在UTF-8中它是C3 BE。這也可以解釋爲什麼它不是有效的UTF-8。

來源

2015-04-01 21:42:17

識別並匹配文件中的非ASCII字符

回答

相關問題