在java中調用grep時，它不適用於法語字符

我在java中調用grep來單獨計算語料庫中單詞列表的數量。在java中調用grep時，它不適用於法語字符

BufferedReader fb = new BufferedReader(
new InputStreamReader( 
    new FileInputStream("french.txt"), "UTF8")); 

while ((l = fb.readLine()) != null){ 
String lpt = "\\b"+l+"\\b"; 
String[] args = new String[]{"grep","-ic",lpt,corpus}; 
Process grepCommand = Runtime.getRuntime().exec(args); 
grep.waitFor() 
} 
BufferedReader grepInput = new BufferedReader(new InputStreamReader(grep.getInputStream())); 
int tmp = Integer.parseInt(grepInput.readLine()); 
System.out.println(l+"\t"+tmp);

這適用於我的英文單詞列表和語料庫。但我也有一個法語單詞列表和語料庫。它不會對Java控制檯上法國和採樣輸出工作看起來是這樣的：

� bord  0 
� c�t�  0

正確的形式：「àBORD」和「的Côté」。

現在我的問題是：問題在哪裏？我應該修復我的java代碼，還是grep問題？如果是這樣，我該如何解決它。（即使我將編碼更改爲UTF-8，我也無法正確在終端上看到法語字符）。

來源

2013-04-07 MAZDAK

爲什麼不使用Java正則表達式引擎？ – 2013-04-07 11:38:32

你確定你的文件是用UTF-8編碼的嗎？更可能是ISO-8859-1或ISO-8859-15或類似的東西。 – 2013-04-07 11:38:41

我建議您逐行讀取文件，然後在字邊界上調用split以獲取單詞數。

public static void main(String[] args) throws IOException { 
    final File file = new File("myFile"); 
    try (final BufferedReader bufferedReader = 
      new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) { 
     String line; 
     while ((line = bufferedReader.readLine()) != null) { 
      final String[] words = line.split("\\b"); 
      System.out.println(words.length + " words in line \"" + line + "\"."); 
     } 
    } 
}

這樣可以避免從你的程序調用grep。

你得到的奇怪字符很可能是使用錯誤的編碼。你確定你的文件是在「UTF-8」嗎？

編輯

OP要讀取一個文件中的行由行，然後搜索在另一個文件中讀取行的出現。

這仍然可以使用java更容易地完成。根據有多大你的其他文件，你可以先讀入內存，並搜索，或搜索一下行由行也

一個簡單的例子把文件讀入內存：

public static void main(String[] args) throws UnsupportedEncodingException, IOException { 
    final File corpusFile = new File("corpus"); 
    final String corpusFileContent = readFileToString(corpusFile); 
    final File file = new File("myEngramFile"); 
    try (final BufferedReader bufferedReader = 
      new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) { 
     String line; 
     while ((line = bufferedReader.readLine()) != null) { 
      final int matches = countOccurencesOf(line, corpusFileContent); 
     }; 
    } 
} 

private static String readFileToString(final File file) throws IOException { 
    final StringBuilder stringBuilder = new StringBuilder(); 
    try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) { 
     final ByteBuffer byteBuffer = ByteBuffer.allocate(4096); 
     final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder(); 
     while (fc.read(byteBuffer) > 0) { 
      byteBuffer.flip(); 
      stringBuilder.append(charsetDecoder.decode(byteBuffer)); 
      byteBuffer.reset(); 
     } 
    } 
    return stringBuilder.toString(); 
} 

private static int countOccurencesOf(final String countMatchesOf, final String inString) { 
    final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString); 
    int count = 0; 
    while (matcher.find()) { 
     ++count; 
    } 
    return count; 
}

這應該如果您的「語料庫」文件少於百兆字節左右，則工作正常。任何大，你會想改變「countOccurencesOf」的方法是這樣的

private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException { 
    final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b"); 
    int count = 0; 
    try (final BufferedReader bufferedReader = 
      new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) { 
     String line; 
     while ((line = bufferedReader.readLine()) != null) { 
      final Matcher matcher = pattern.matcher(line); 
      while (matcher.find()) { 
       ++count; 
      } 
     }; 
    } 
    return count; 
}

現在你只需通過你的「文件」對象進入方法，而不是字符串化的文件。

請注意，流式方法逐行讀取文件並因此丟棄換行符，如果您的Pattern依賴於它們，則需要在解析String之前將它們添加回去。

來源

2013-04-07 16:41:39

我所需要的是一個語料庫中的n-gram數量，對於任何給定n-gram從另一個文件（fb）讀取。你是對的，奇怪的字符是由於文件編碼。 – MAZDAK 2013-04-08 11:18:32

問題在於你的設計。不要從java調用grep。改爲使用純java實現：逐行讀取文件並使用純Java API實現您自己的「grep」。

但嚴重的是我認爲問題出在你的shell中。你是否嘗試手動運行grep並過濾法文字符？我相信它不適合你。這取決於你的外殼配置，因此取決於平臺。 Java可以提供平臺無關的解決方案。爲了達到這個目標，你應該儘可能避免使用包括執行命令行工具在內的非純Java技術。

順便讀一遍您的文件並使用String.contains()或模式匹配進行行篩選的BTW代碼，它甚至比運行grep的代碼短。

來源

2013-04-07 11:43:04 AlexR

我同意，也許不是String.contains（），但我認爲模式匹配是一個好主意。調用ggrep需要很多時間，它甚至可能會更快。然而，我仍然有同樣的問題，而在Java控制檯上顯示結果 – MAZDAK 2013-04-07 16:43:33

原來它實際上是慢得多在java中實現整個事情，在我的巨大語料庫 – MAZDAK 2013-04-08 11:15:02

在java中調用grep時，它不適用於法語字符

回答

相關問題