2017-02-17 47 views
1

我有.sh,.txt,.sql,.pkb等文件,文件大小超過10 MB,這意味着超過10萬行。使用Java從大文件中刪除註釋

我想從這些文件中刪除註釋,然後再使用未註釋的內容。我爲它編寫了下面的代碼。

/** 
* Removes all the commented part from the file content as well as returns a 
* file structure which have just lines with declaration syntax for eg. 
* Create Package packageName <- Stores all decalartion lines as separate 
* string in an array 
* 
* @param file 
* @return file content 
* @throws IOException 
*/ 
private static String[] filterContent(File file) throws IOException { 

    String withoutComment = ""; 
    String declare = ""; 
    String[] content; 
    List<String> readLines = FileUtils.readLines(file); 

    int size = readLines.size(); 
    System.out.println(file.getName() + " Files number of lines "+ size + " at "+new Date()); 
    String[] declareLines = new String[size]; 
    int startComment = 0; 
    int endComment = 0; 
    Boolean check = false; 
    int j = 0; 
    int i=0; 
    // Reading content line by line 
    for (String line:readLines) { 
     // If line contains */ that means comment is ending in this line, 
     // making a note of the line number 
     if (line.toString().contains("*/")) { 
      endComment = i; 
      // Removing the content before */ from the line 
      int indexOf = line.indexOf("*/"); 
      line = line.replace(line.substring(0, indexOf + 2), ""); 
     } 

     // If startComment is assigned fresh value and end comment hasn't, 
     // that means the current line is part of the comment 
     // Ignoring the line in this case and moving on to the next one 
     if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check) 
      continue; 

     // If line contains /* that means comment is starting in this line, 
     // making a note of the line number 
     if (line.contains("/*")) { 
      startComment = i; 
      // Removing the content after /* from the line 
      int indexOf = line.indexOf("/*"); 
      line = line.replace(line.substring(indexOf), ""); 
      if (i == 0) 
       check = true; // means comment in the very first line 
     } 

     // If line contains -- that means single line comment is present in 
     // this line, 
     // removing the content after -- 
     if (line.contains("--")) { 
      int indexOf = line.indexOf("--"); 
      line = line.replace(line.substring(indexOf), ""); 
     } 
     // If line contains -- that means single line comment is present in 
     // this line, 
     // removing the content after -- 
     if (line.contains("#")) { 
      int indexOf = line.indexOf("#"); 
      line = line.replace(line.substring(indexOf), ""); 
     } 

     // At this point, all commented part is removed from the line, hence 
     // appending it to the final content 
     if (!line.isEmpty()) 
      withoutComment = withoutComment + line + " \n"; 
     // If line contains CREATE its a declaration line, holding it 
     // separately in the array 
     if (line.toUpperCase().contains(("CREATE"))) { 
      // If next line does not contains Create and the current line is 
      // the not the last line, 
      // then considering two consecutive lines as declaration line, 
      if (i < size - 1 && !readLines.get(i + 1).toString().toUpperCase().contains(("CREATE"))) { 
       declare = line + " " + readLines.get(i + 1).toString() + "\n"; 
      } else if (i < size) {// If the line is last line, including 
            // that line alone. 
       declare = line + "\n"; 
      } 

      declareLines[j] = declare.toUpperCase(); 
      j++; 
     } 
     i++; 
    } 
    System.out.println("Read lines "+ new Date()); 
    List<String> list = new ArrayList<String>(Arrays.asList(declareLines)); 
    list.removeAll(Collections.singleton(null)); 

    content = list.toArray(new String[list.size() + 1]); 

    withoutComment = withoutComment.toUpperCase(); 
    content[j] = withoutComment; 
    System.out.println("Retruning uncommented content "+ new Date()); 
    return content; 
} 


public static void main(String[] args) { 
     String[] content = filterContent(new File("abc.txt")); 
} 

這個代碼的問題是它太慢,如果文件大小很大。對於10 MB文件,刪除評論需要6個多小時。 (代碼在SSH服務器上運行)。

我可以擁有大小不超過100 MB的文件,在這個文件中需要幾天時間才能刪除評論。我如何更快地刪除評論?

更新:問題不是重複的,因爲我的問題不僅僅是通過改變閱讀行的方式來解決。它的字符串活動使得這個過程變得緩慢,我需要一種方法來使評論移除活動更快。

+0

1.不要將整個文件放在內存中。 2.你爲什麼想這樣做? – Axel

+0

首先,不要把它放到列表中,使用InputStream讀取文件並直接分析行。你可以很容易地找到一行是否包含'/ *'或'/ * ... * /',刪除它並重新創建沒有註釋的新文件。讀取超過100MB的文件應該不會花費那麼長的時間... – AxelH

+0

[如何使用Java逐行讀取大型文本文件?](http://stackoverflow.com/questions/5868369/how-to -read-a-large-text-file-line-by-java) – AxelH

回答

0

發現我的代碼最大的問題是使用Strings。用任何方法讀取行不會造成很大的差別,但使用StringBuilder而不是String來存儲未註釋的行,從而大幅改變了性能。現在,與StringBuilder相同的代碼需要幾秒鐘時間才能刪除需要花費數小時的註釋。

這是代碼。爲了獲得更好的性能,我將List更改爲BufferedReader

/** 
    * Removes all the commented part from the file content as well as returns a 
    * file structure which have just lines with declaration syntax for eg. 
    * Create Package packageName <- Stores all decalartion lines as separate 
    * string in an array 
    * 
    * @param file 
    * @return file content 
    * @throws IOException 
    */ 
    private static List<String> filterContent(File file) throws IOException { 

     StringBuilder withoutComment = new StringBuilder(); 
//  String declare = ""; 
//  String[] content; 
//  List<String> readLines = FileUtils.readLines(file); 
// 
//  int size = readLines.size(); 
     System.out.println(file.getName() + " at " + new Date()); 
     List<String> declareLines = new ArrayList<String>(); 
     // String line = null; 
     int startComment = 0; 
     int endComment = 0; 
     Boolean check = false; 
     Boolean isLineDeclaration = false; 

     int j = 0; 
     int i = 0; 

     InputStream in = new FileInputStream(file); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(in)); 
     String line; 
     // Reading content line by line 
     while ((line = reader.readLine()) != null) { 
      // for (int i = 0; i < size; i++) { 
      // line = readLines.get(i).toString();// storing current line data 
      // If line contains */ that means comment is ending in this line, 
      // making a note of the line number 
      if (line.toString().contains("*/")) { 
       endComment = i; 
       // Removing the content before */ from the line 
       int indexOf = line.indexOf("*/"); 
       line = line.replace(line.substring(0, indexOf + 2), ""); 
      } 

      // If startComment is assigned fresh value and end comment hasn't, 
      // that means the current line is part of the comment 
      // Ignoring the line in this case and moving on to the next one 
      if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check) 
       continue; 

      // If line contains /* that means comment is starting in this line, 
      // making a note of the line number 
      if (line.contains("/*")) { 
       startComment = i; 
       // Removing the content after /* from the line 
       int indexOf = line.indexOf("/*"); 
       line = line.replace(line.substring(indexOf), ""); 
       if (i == 0) 
        check = true; // means comment in the very first line 
      } 

      // If line contains -- that means single line comment is present in 
      // this line, 
      // removing the content after -- 
      if (line.contains("--")) { 
       int indexOf = line.indexOf("--"); 
       line = line.replace(line.substring(indexOf), ""); 
      } 
      // If line contains -- that means single line comment is present in 
      // this line, 
      // removing the content after -- 
      if (line.contains("#")) { 
       int indexOf = line.indexOf("#"); 
       line = line.replace(line.substring(indexOf), ""); 
      } 

      // At this point, all commented part is removed from the line, hence 
      // appending it to the final content 
      if (!line.isEmpty()) 
       withoutComment.append(line).append(" \n"); 
      // If line contains CREATE its a declaration line, holding it 
      // separately in the array 
      if (line.toUpperCase().contains(("CREATE"))) { 
       // If next line does not contains Create and the current line is 
       // the not the last line, 
       // then considering two consecutive lines as declaration line, 
       declareLines.add(line.toUpperCase()); 

       isLineDeclaration = true; 
       j++; 
      } else if (isLineDeclaration && !line.toUpperCase().contains(("CREATE"))) { 
       // If next line does not contains Create and the current line is 
       // the not the last line, 
       // then considering two consecutive lines as declaration line, 
       declareLines.set(j - 1, declareLines.get(j - 1) + " " + line.toUpperCase()); 
       isLineDeclaration = false; 
      } 
      i++; 
     } 

     reader.close(); 
     System.out.println("Read lines " + new Date()); 
//  List<String> list = new ArrayList<String>(Arrays.asList(declareLines)); 
     declareLines.removeAll(Collections.singleton(null)); 

//  content = list.toArray(new String[list.size() + 1]); 

//  withoutComment = withoutComment..toUpperCase(); 
     declareLines.add(withoutComment.toString().toUpperCase()); 
     System.out.println("Retruning uncommented content " + new Date()); 
     return declareLines; 
    } 
0

您可以創建多個線程做的工作(需要您行的正確分裂)

+0

該文件甚至可能有50萬行。不會創建數百個線程重載線程堆棧? –

0

一些主意,讓這些代碼更快

使用InputStream讀取該文件,並直接分析線,將新的String存儲在未註釋的文件中。這將防止多次讀取文件(一旦創建List<String> readLines,一旦完成您的迭代)

設計,您可以使用註釋語法而不是此redondant代碼的映射。

一旦這樣做,這應該是更快。當然,多線程可能是一個解決方案,但是這需要進行一些檢查,以確保您不會將文件拆分爲註釋塊。所以,首先改善代碼,然後你可以想到這一點。