2017-04-22 80 views
0

我有一個查詢正在做我想要的截斷數據集,但是當我在完整的數據集(數百萬行)上運行它時,需要永遠運行。有沒有辦法來優化這個MySQL查詢(更新,多個連接)?

我有兩個表 - microsat_table和coverage_table。

microsat_table:

+----+----------+-----------+---------+-------------------------------------------------+ 
| id | Seq_Name | SSR_Start | SSR_End | Sequence          | 
+----+----------+-----------+---------+-------------------------------------------------+ 
| 2 | chr2L |  11050 | 11067 | TTTAATTTAATTTAATTT        | 
| 3 | chr2L |  44173 | 44187 | TATGTATGTATGTAT         | 
| 5 | chr2L |  54431 | 54477 | ATAATAATATAATATAATATAATATAATATATAATAATATAATAATA | 
| 6 | chr2L |  57571 | 57594 | ATATATATATATATATATATATAT      | 
| 7 | chr2L |  72439 | 72453 | CATACATACATACAT         | 
| 8 | chr2L |  74028 | 74042 | ATACATACATACATA         | 
| 9 | chr2L |  85573 | 85587 | ATTTTATTTTATTTT         | 
| 10 | chr2L |  92429 | 92443 | ACATACATACATACA         | 
| 11 | chr2L | 138132 | 138166 | TATATAGATATATAAATATATATATATATATATAT    | 
| 13 | chr2L | 162245 | 162259 | ATACATACATACATA         | 
+----+----------+-----------+---------+-------------------------------------------------+ 

coverage_table:

| Seq_Name | Start | Stop | Coverage | 
+----------+-------+-------+----------+ 
| chr2L | 5716 | 5771 |  1 | 
| chr2L | 8730 | 8824 |  1 | 
| chr2L | 9894 | 9948 |  1 | 
| chr2L | 19391 | 19491 |  1 | 
| chr2L | 19575 | 19675 |  1 | 
| chr2L | 19773 | 19776 |  1 | 
| chr2L | 19776 | 19872 |  2 | 
| chr2L | 21920 | 21959 |  1 | 
| chr2L | 21959 | 22020 |  2 | 
| chr2L | 22020 | 22059 |  1 | 
+----------+-------+-------+----------+ 

我要添加一列,其計算平均覆蓋率(從coverage_table)的microsat_table過的所有行啓動和停止值在覆蓋表中落入microsat_table中的SSR_Start和SSR_End值。

結果舉例:

+-----+----------+-----------+---------+--------------------------------+---------+ 
| id | Seq_Name | SSR_Start | SSR_End | Sequence      | avg  | 
+-----+----------+-----------+---------+--------------------------------+---------+ 
| 53 | chr2L | 402489 | 402503 | AAAACAAAACAAAAC    | 3.0000 | 
| 64 | chr2L | 447214 | 447233 | CAGCAGCAGCAGCAGCAGCA   | 8.0000 | 
| 66 | chr2L | 457839 | 457868 | CAGCAGCAGCAACAGCAGCAGCAGGCAGCA | 2.0000 | 
| 105 | chr2L | 579589 | 579603 | TCGAATCGAATCGAA    | 11.0000 | 
| 123 | chr2L | 628484 | 628501 | TAATGTTAATGTTAATGT    | 6.0000 | 
+-----+----------+-----------+---------+--------------------------------+---------+ 

我的查詢是:

UPDATE microsat_table 
JOIN 
    (SELECT m.id, SUM(p.Coverage)/count(p.Start) 
     AS avg FROM microsat_table m 
     LEFT OUTER JOIN coverage_table p 
     ON m.Seq_Name LIKE p.Seq_Name 
     WHERE m.Seq_Name LIKE p.Seq_Name GROUP BY m.id) AS qt 
ON microsat_table.id = qt.id 
SET microsat_table.avg = qt.avg; 

解釋爲截斷的表結果:

+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+ 
| id | select_type | table    | partitions | type | possible_keys          | key   | key_len | ref       | rows | filtered | Extra            | 
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+ 
| 1 | UPDATE  | microsat_table_short | NULL  | ALL | PRIMARY           | NULL  | NULL | NULL       | 40356 | 100.00 | NULL            | 
| 1 | PRIMARY  | <derived2>   | NULL  | ref | <auto_key0>          | <auto_key0> | 4  | testdb.microsat_table_short.id | 1236 | 100.00 | NULL            | 
| 2 | DERIVED  | m     | NULL  | index | PRIMARY,Sequence,Seq_Name,Motif,SSR_Start,SSR_End | Seq_Name | 53  | NULL       | 40356 | 100.00 | Using index; Using temporary; Using filesort  | 
| 2 | DERIVED  | p     | NULL  | ALL | NULL            | NULL  | NULL | NULL       | 100163 |  1.23 | Using where; Using join buffer (Block Nested Loop) | 
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+ 

我添加索引(包括試圖HASH和BTREE索引)這大大加快了速度,但是我已經讓它在大型數據集上運行了1.5天,但它仍然沒有發現ISH。

有沒有人有如何使其運行速度更快的建議?

謝謝!

+0

還請爲實際表添加查詢計劃,因爲這是什麼慢 –

+0

結果集不對應於數據集。請參閱https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query – Strawberry

回答

1

在你的代碼中有一些相對較小的infelicities。然而,最大的問題在於,儘管你說要計算覆蓋表中的開始和停止值落入microsat_table中的SSR_Start和SSR_End值的所有行的平均覆蓋範圍(您不需要)實際上似乎限制查詢這樣做。相反,你只在​​上編碼匹配。

下面的代碼試圖解決這個問題(我用>=<=這可能不是你所需要的)和其他更小的位:

UPDATE microsat_table 
JOIN 
    (
    SELECT 
     m.id, 
     AVG(p.Coverage) AS avg -- MySQL has it's own average function 
    FROM 
     microsat_table m 
     INNER JOIN coverage_table p ON -- Change to INNER JOIN, your old WHERE clause had this effect anyway 
      m.Seq_Name = p.Seq_Name -- Use '=' not 'Like' when looking for an exact match 
    WHERE 
     p.Start >= m.SSR_Start -- This WHERE clause is the most important change 
     AND p.End <= m.SSR_End -- You omitted it in your version 
    GROUP BY 
     m.id) AS qt 
ON microsat_table.id = qt.id 
SET microsat_table.avg = qt.avg; 
+0

在此建議已解決(並經過驗證)之前,有關索引的原始問題無法得到解答。 –

+0

謝謝!它在命令行中將查詢從40秒縮短爲8秒!由於某些原因,當我在腳本中運行完全相同的查詢時,它會將其延長到900秒: - /。我試圖找出那裏正在發生的事情,但它似乎必須是腳本,而不是查詢。 –

0

你是否真的需要使用'LIKE',這是最糟糕的表現之一。

+0

並且不要做同樣的事情兩次。 –

+0

謝謝!刪除「LIKE」似乎有幫助! –

0

也許更新表1個的大交易簡直是太系統很多? (你正在更新的表的大小是多少?)你可以嘗試以塊的形式進行。我也想去一個簡單的子選擇,看起來更容易閱讀恕我直言。

還要注意Steve Lovell的說法,即您的查詢似乎不關心開始/停止欄。因爲你很可能忘了意外我在這裏加太,刪除它應該不會太困難=)

DECLARE @min_id int, 
     @max_id int, 
     @blocksize int 

SELECT @min_id = MIN(id), 
     @max_id = MAX(id), 
     @blocksize = 100000 -- adapt as needed 
    FROM microsat_table 

WHILE @min_id <= @max_id 
    BEGIN 

     UPDATE microsat_table 
      SET microsat_table.avg = ((SELECT SUM(p.Coverage)/count(p.Start) AS avg 
             FROM microsat_table m 
             LEFT OUTER JOIN coverage_table p 
                ON m.Seq_Name LIKE p.Seq_Name -- if possble use '=' here instead of LIKE 
                AND p.Start >= m.SSR_Start -- flagrantly "stolen" from Steve Lovell's answer 
                AND p.End <= m.SSR_End 
             WHERE m.id = microsat_table.id) 
     -- limit update to this block: 
     WHERE microsat_table.id BETWEEN @min_id AND (@min_id + @blocksize - 1) 

     -- prepare for next block 
     SELECT @min_id = @min_id + @blocksize 
    END 

你可能想對microsat_tableid現場和Seq_name + Start主鍵專欄coverage_table

+0

謝謝!是的,我確實忘了它:-)。問題 - 我以前從未使用過mysql存儲過程和變量,是否需要將所有代碼放入函數中,或者可以將它作爲輸入到命令行中?當我輸入時出現語法錯誤,並嘗試將它放入一個也給我語法錯誤的函數。我肯定可以做錯誤的函數聲明,因爲我從來沒有與他們合作過...... –

+0

我承認已經習慣了MSSQL。看起來我很天真,但錯誤地認爲這幾乎是ANSI SQL,並且在MySQL上也不會太麻煩。試圖讓它在sqlfiddle上運行表明我錯了......我會試着讓它在MySQL上工作,然後調整我的代碼......(這可能需要一段時間=) – deroby

相關問題