2014-11-03 83 views
2

問題:我有一個700字符串的數組,我正在讀入List。然後我有一個包含1500多個文件的目錄。我需要打開這些文件中的每一個,看看700個字符串中是否有任何一個出現在每個文件中的任何位置。加速讀取多個文件並將其內容與多個字符串進行比較

目前的解決方案:閱讀在700個字符串(這是幾乎瞬時)之後,這是我在做什麼:

public static void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) { 
    for (final File fileEntry : myDirectory.listFiles()) { 
     System.out.println("Entering file: " + currentCount++); 
     if (fileEntry.isDirectory()) { 
      scanMyDirectory(fileEntry, listOfStrings); 
     } else { 

      BufferedReader br = null; 
      try { 
       String sCurrentLine; 
       br = new BufferedReader(new FileReader(fileEntry.getPath())); 

       while ((sCurrentLine = br.readLine()) != null) { 
        for (int i = 0; i < listOfStrings.size(); i++) { 
         if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(sCurrentLine, listOfStrings.get(i))) { 
          matchLocations.put(listOfStrings.get(i), fileEntry.getPath()); 
         } 
        } 
       } 
      } catch (IOException e) { 
       e.printStackTrace(); 
      } finally { 
       try { 
        if (br != null) { 
         br.close(); 
        } 
       } catch (IOException ex) { 
        ex.printStackTrace(); 
       } 
      } 
     } 
    } 
} 

調用這個程序後,我已經存儲在所有結果一個HashMap和我可以輸出結果到屏幕或文件。

問題:什麼是更快的方法來做到這一點?它看起來非常慢(大約需要20-25分鐘才能運行1500個文件)。我對線程不是很熟悉,但我曾考慮過使用它。然而,this question的最佳答案讓我有點失望。什麼是加速表現的最佳方式?

+0

根據你所鏈接的答案,多線程這不會是一個好主意。您是否使用NIO,正如答案所示? – Azar 2014-11-03 11:48:27

+0

不,這是我正在考慮的另一件事。在深入研究特定路線之前,我希望能夠儘可能多地評估答案。 – 2014-11-03 11:50:35

+0

您鏈接的答案是正確的。除非你從15個不同的SSD讀取文件,否則將是瓶頸。 – Michael 2014-11-03 11:50:58

回答

2

我喜歡NIO讀線:

private final Map<String, String> matchLocations = new HashMap<>(); 
private int currentCount = 0; 

public void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) { 
    File[] files = myDirectory.listFiles(); 
    if (files == null) { 
     return; 
    } 
    Stream.of(files).forEach(fileEntry -> { 
     if (fileEntry.isDirectory()) { 
      scanMyDirectory(fileEntry, listOfStrings); 
     } else { 
      System.out.println("Entering file: " + currentCount++); 
      try { 
       List<String> lines = Files.readAllLines(Paths.get(fileEntry.getAbsolutePath()), StandardCharsets.UTF_8); 
       StringBuilder sb = new StringBuilder(); 
       lines.forEach(s -> sb.append(s.toLowerCase()).append("\n")); 
       listOfStrings.forEach(s -> { 
        if (sb.indexOf(s.toLowerCase()) > 0) { 
         matchLocations.put(s, fileEntry.getPath()); 
        } 
       }); 
      } catch (IOException e) { 
       e.printStackTrace(); 
      } 
     } 
    }); 
} 

如上所述,沒有必要多線程...但如果你有興趣:

private final ConcurrentHashMap<String, String> matchLocations = new ConcurrentHashMap<>(); 
private final ForkJoinPool pool = new ForkJoinPool(); 
private int currentCount = 0; 

public void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) { 
    File[] files = myDirectory.listFiles(); 
    if (files == null) { 
     return; 
    } 
    Stream.of(files).forEach(fileEntry -> { 
     if (fileEntry.isDirectory()) { 
      scanMyDirectory(fileEntry, listOfStrings); 
     } else { 
      System.out.println("Entering file: " + currentCount++); 
      pool.submit(new Reader(listOfStrings, fileEntry)); 
     } 
    }); 
} 

class Reader implements Runnable { 

    final List<String> listOfStrings; 
    final File file; 

    Reader(List<String> listOfStrings, File file) { 
     this.listOfStrings = listOfStrings; 
     this.file = file; 
    } 

    @Override 
    public void run() { 
     try { 
      List<String> lines = Files.readAllLines(Paths.get(file.getAbsolutePath()), StandardCharsets.UTF_8); 
      StringBuilder sb = new StringBuilder(); 
      lines.forEach(s -> sb.append(s.toLowerCase()).append("\n")); 
      listOfStrings.forEach(s -> { 
       if (sb.indexOf(s.toLowerCase()) > 0) { 
        matchLocations.put(s, file.getPath()); 
       } 
      }); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 

} 

編輯

錯誤修正:

private final ConcurrentHashMap<String, List<String>> matchLocations = new ConcurrentHashMap<>(); 
private final ForkJoinPool pool = new ForkJoinPool(); 
private int currentCount = 0; 

public void scanMyDirectory(final File myDirectory, final List<String> listOfStrings) { 
    File[] files = myDirectory.listFiles(); 
    if (files == null) { 
     return; 
    } 
    Stream.of(files).forEach(fileEntry -> { 
     if (fileEntry.isDirectory()) { 
      scanMyDirectory(fileEntry, listOfStrings); 
     } else { 
      System.out.println("Entering file: " + currentCount++); 
      Reader reader = new Reader(listOfStrings, fileEntry); 
      pool.submit(reader); 
     } 
    }); 
} 

class Reader implements Runnable { 

    final List<String> listOfStrings; 
    final File file; 

    Reader(List<String> listOfStrings, File file) { 
     this.listOfStrings = listOfStrings; 
     this.file = file; 
    } 

    @Override 
    public void run() { 
     try (FileInputStream fileInputStream = new FileInputStream(file); 
      FileChannel channel = fileInputStream.getChannel()) { 
      StringBuilder sb = new StringBuilder(); 
      ByteBuffer buffer = ByteBuffer.allocate(512); 
      int read; 
      while (true) { 
       read = channel.read(buffer); 
       if (read == -1) { 
        break; 
       } 
       buffer.flip(); 
       sb.append(new String(buffer.array()).toLowerCase()); 
       buffer.clear(); 
      } 
      listOfStrings.stream() 
        .map(String::toLowerCase) 
        .forEach(s -> { 
         if (sb.indexOf(s) > 0) { 
          List<String> current = matchLocations.get(s); 
          if (current == null) { 
           current = new ArrayList<>(); 
           matchLocations.put(s, current); 
          } 
          current.add(file.getAbsolutePath()); 
         } 
        }); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 

} 
+0

我喜歡這個解決方案,但是我遇到了問題。在一些有效的文件上,它會跳過它們,並給出一個java.nio.charset.MalformedInputException:Input length = 1錯誤。當我打開它們時,它們對我來說看起來很好。這個問題會是什麼? – 2014-11-03 13:01:50

+0

@AndrewMartin可能是由於文件的編碼...我已經編輯我的職務 – FaNaJ 2014-11-03 13:15:54

+0

我覺得是。我以小寫字母進行了試用,並將我的字符串列表更改爲小寫字母,並與我的原始結果相匹配。非常感謝。 – 2014-11-03 13:24:47

相關問題