2011-04-09 96 views
2

我試圖將所有googlebooks-1gram文件導入到postgresql數據庫中。我寫了下面的Java代碼爲:如何提高此代碼的速度?

public class ToPostgres { 

    public static void main(String[] args) throws Exception { 
     String filePath = "./"; 
     List<String> files = new ArrayList<String>(); 
     for (int i =0; i < 10; i++) { 
      files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv"); 
     } 
     Connection c = null; 
     try { 
      c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks", 
        "postgres", "xxxxxx"); 
     } catch (SQLException e) { 
      e.printStackTrace(); 
     } 

     if (c != null) { 
      try { 
       PreparedStatement wordInsert = c.prepareStatement(
        "INSERT INTO words (word) VALUES (?)", Statement.RETURN_GENERATED_KEYS 
       ); 
       PreparedStatement countInsert = c.prepareStatement(
        "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " + 
        "VALUES (?,?,?,?,?)" 
       ); 
       String lastWord = ""; 
       Long lastId = -1L; 
       for (String filename: files) { 
        BufferedReader input = new BufferedReader(new FileReader(new File(filename))); 
        String line = ""; 
        while ((line = input.readLine()) != null) { 
         String[] data = line.split("\t"); 
         Long id = -1L; 
         if (lastWord.equals(data[0])) { 
          id = lastId; 
         } else { 
          wordInsert.setString(1, data[0]); 
          wordInsert.executeUpdate(); 
          ResultSet resultSet = wordInsert.getGeneratedKeys(); 
          if (resultSet != null && resultSet.next()) 
          { 
           id = resultSet.getLong(1); 
          } 
         } 
         countInsert.setLong(1, id); 
         countInsert.setInt(2, Integer.parseInt(data[1])); 
         countInsert.setInt(3, Integer.parseInt(data[2])); 
         countInsert.setInt(4, Integer.parseInt(data[3])); 
         countInsert.setInt(5, Integer.parseInt(data[4])); 
         countInsert.executeUpdate(); 
         lastWord = data[0]; 
         lastId = id; 
        } 
       } 
      } catch (SQLException e) { 
       e.printStackTrace(); 
      } 
     } 
    } 

} 

然而,在運行此爲約3小時,此時它只能放在wordcounts表1.000.000條目。當我檢查整個1gram數據集中的行數時,它是500.000.000行。因此,導入所有東西需要大約62.5天,我可以接受它在大約一週內進口,但是2個月?我認爲我在這裏做了嚴重錯誤的事情(我有一臺運行24/7的服務器,所以我可以真正運行它很長時間,但更快會是很好的XD)

編輯:此代碼是我解決它:

public class ToPostgres { 

    public static void main(String[] args) throws Exception { 
     String filePath = "./"; 
     List<String> files = new ArrayList<String>(); 
     for (int i =0; i < 10; i++) { 
      files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv"); 
     } 
     Connection c = null; 
     try { 
      c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks", 
        "postgres", "xxxxxx"); 
     } catch (SQLException e) { 
      e.printStackTrace(); 
     } 

     if (c != null) { 
      c.setAutoCommit(false); 
      try { 
       PreparedStatement wordInsert = c.prepareStatement(
        "INSERT INTO words (id, word) VALUES (?,?)" 
       ); 
       PreparedStatement countInsert = c.prepareStatement(
        "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " + 
        "VALUES (?,?,?,?,?)" 
       ); 
       String lastWord = ""; 
       Long id = 0L; 
       for (String filename: files) { 
        BufferedReader input = new BufferedReader(new FileReader(new File(filename))); 
        String line = ""; 
        int i = 0; 
        while ((line = input.readLine()) != null) { 
         String[] data = line.split("\t"); 
         if (!lastWord.equals(data[0])) { 
          id++; 
          wordInsert.setLong(1, id); 
          wordInsert.setString(2, data[0]); 
          wordInsert.executeUpdate(); 
         } 
         countInsert.setLong(1, id); 
         countInsert.setInt(2, Integer.parseInt(data[1])); 
         countInsert.setInt(3, Integer.parseInt(data[2])); 
         countInsert.setInt(4, Integer.parseInt(data[3])); 
         countInsert.setInt(5, Integer.parseInt(data[4])); 
         countInsert.executeUpdate(); 
         lastWord = data[0]; 
         if (i % 10000 == 0) { 
          c.commit(); 
         } 
         if (i % 100000 == 0) { 
          System.out.println(i+" mark file "+filename); 
         } 
         i++; 
        } 
        c.commit(); 
       } 
      } catch (SQLException e) { 
       e.printStackTrace(); 
      } 
     } 
    } 

} 

我現在約15分鐘達到150萬行。這對我來說足夠快,謝謝大家!

+0

嘗試不執行的SQL運行時,你可能會發現,你需要調整你的PostgreSQL數據庫比什麼都重要。 – 2011-04-09 15:31:52

回答

4

默認情況下,JDBC連接具有啓用自動提交的功能,該自動提交包含每個語句的開銷。嘗試禁用它:

c.setAutoCommit(false) 

然後提交分批,分東西線沿線的:

long ops = 0; 

for(String filename : files) { 
    // ... 
    while ((line = input.readLine()) != null) { 
     // insert some stuff... 

     ops ++; 

     if(ops % 1000 == 0) { 
      c.commit(); 
     } 
    } 
} 

c.commit(); 
+0

我會嘗試,結合在這裏提出的所有其他改進,這肯定可以幫助:) – teuneboon 2011-04-09 15:36:46

2

編寫它來執行線程,同時運行4個線程,或者將它分成幾部分(從配置文件讀取)並將其分發到X機器,並讓它們獲取數據togeather。

+0

嗯,我想線程可以幫助,不幸的是我只有1臺機器可以運行它,而且只有2個內核。所以最高速度可能仍然是31。25天 – teuneboon 2011-04-09 15:31:16

+1

我不確定這個數學是否正確,因爲在讀取文件或更新數據庫時,核心處於空閒狀態,因此有5-6-7個線程可能仍會顯示改進。 – MeBigFatGuy 2011-04-09 15:59:53

0

使用batch statements在一個時間在同一時間執行多個刀片,而不是一個INSERT。

此外,我會刪除算法中每次插入到words表後更新字數的部分,而只是在插入words完成後計算所有字計數。

3

如果您的表有索引,刪除它們可能會更快,插入數據並在稍後重新創建索引。

設置自動提交關閉,並且每10 000條記錄進行一次手動提交(查看合理值的文檔 - 有一定限制)也可以加速。

自己生成索引/外鍵,並跟蹤它應該快於wordInsert.getGeneratedKeys();,但我不確定,從您的內容是否可能。

有一種稱爲「批量插入」的方法。我不記得細節,但它是搜索的起點。

+0

關鍵的好的提示,我可以很容易地生成這些我自己 – teuneboon 2011-04-09 15:43:34

+0

是的,你可以創建一個自動增量後給定的種子。 – 2011-04-09 15:48:45

+1

使用['COPY'命令](http://www.postgresql.org/docs/8.3/interactive/sql-copy.html)完成文件中postgres的批量插入操作。 – 2011-04-09 16:03:21

0

創建線程

String lastWord = ""; 
    Long lastId = -1L; 
    PreparedStatement wordInsert; 
    PreparedStatement countInsert ; 
    public class ToPostgres { 
     public void main(String[] args) throws Exception { 
      String filePath = "./"; 
      List<String> files = new ArrayList<String>(); 
      for (int i =0; i < 10; i++) { 
       files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv"); 
      } 
      Connection c = null; 
      try { 
       c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks", 
         "postgres", "xxxxxx"); 
      } catch (SQLException e) { 
       e.printStackTrace(); 
      } 

      if (c != null) { 
       try { 
        wordInsert = c.prepareStatement(
         "INSERT INTO words (word) VALUES (?)", Statement.RETURN_GENERATED_KEYS 
        ); 
        countInsert = c.prepareStatement(
         "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " + 
         "VALUES (?,?,?,?,?)" 
        ); 
        for (String filename: files) { 
         new MyThread(filename). start(); 
        } 
       } catch (SQLException e) { 
        e.printStackTrace(); 
       } 
      } 
     } 

    } 
    class MyThread extends Thread{ 
     String file; 
     public MyThread(String file) { 
      this.file = file; 
     } 
     @Override 
     public void run() {   
      try { 
       super.run(); 
       BufferedReader input = new BufferedReader(new FileReader(new File(file))); 
       String line = ""; 
       while ((line = input.readLine()) != null) { 
        String[] data = line.split("\t"); 
        Long id = -1L; 
        if (lastWord.equals(data[0])) { 
         id = lastId; 
        } else { 
         wordInsert.setString(1, data[0]); 
         wordInsert.executeUpdate(); 
         ResultSet resultSet = wordInsert.getGeneratedKeys(); 
         if (resultSet != null && resultSet.next()) 
         { 
          id = resultSet.getLong(1); 
         } 
        } 
        countInsert.setLong(1, id); 
        countInsert.setInt(2, Integer.parseInt(data[1])); 
        countInsert.setInt(3, Integer.parseInt(data[2])); 
        countInsert.setInt(4, Integer.parseInt(data[3])); 
        countInsert.setInt(5, Integer.parseInt(data[4])); 
        countInsert.executeUpdate(); 
        lastWord = data[0]; 
        lastId = id; 
       } 
      } catch (NumberFormatException e) { 
       e.printStackTrace(); 
      } catch (FileNotFoundException e) { 
       e.printStackTrace(); 
      } catch (IOException e) { 
       e.printStackTrace(); 
      } catch (SQLException e) { 
       e.printStackTrace(); 
      } 
     }