2011-03-15 56 views
1

我有四個MySQL表之間的相關性搜索文本)
反應(ID,poll_id,option_id,USER_ID)高級MySQL:</p> <p><strong>用戶</strong>(ID,姓名)<br> <strong>投票</strong>(ID文本)<br> <strong>選項</strong>(ID,poll_id:問卷調查答覆

給定一個具體的調查和特定選項,我會LIK e生成一個表格,顯示來自其他民意調查的哪些選項是最相關的。

想這是我們的數據集:

TABLE users: 
+------+-------+ 
| id | name | 
+------+-------+ 
| 1 | Abe | 
| 2 | Bob | 
| 3 | Che | 
| 4 | Den | 
+------+-------+ 

TABLE polls: 
+------+-----------------------+ 
| id | text     | 
+------+-----------------------+ 
| 1 | Do you like apples? | 
| 2 | What is your gender? | 
| 3 | What is your height? | 
| 4 | Do you like polls? | 
+------+-----------------------+ 

TABLE options: 

+------+----------+---------+ 
| id | poll_id | text | 
+------+----------+---------+ 
| 1 | 1  | Yes  | 
| 2 | 1  | No  | 
| 3 | 2  | Male | 
| 4 | 2  | Female | 
| 5 | 3  | Short | 
| 6 | 3  | Tall | 
| 7 | 4  | Yes  | 
| 8 | 4  | No  | 
+------+----------+---------+ 

TABLE responses: 

+------+----------+------------+----------+ 
| id | poll_id | option_id | user_id | 
+------+----------+------------+----------+ 
| 1 | 1  | 1   | 1  | 
| 2 | 1  | 2   | 2  | 
| 3 | 1  | 2   | 3  | 
| 4 | 1  | 2   | 4  | 
| 5 | 2  | 3   | 1  | 
| 6 | 2  | 3   | 2  | 
| 7 | 2  | 3   | 3  | 
| 8 | 2  | 4   | 4  | 
| 9 | 3  | 5   | 1  | 
| 10 | 3  | 6   | 2  | 
| 10 | 3  | 5   | 3  | 
| 10 | 3  | 6   | 4  | 
| 10 | 4  | 7   | 1  | 
| 10 | 4  | 7   | 2  | 
| 10 | 4  | 7   | 3  | 
| 10 | 4  | 7   | 4  | 
+------+----------+------------+----------+ 

鑑於投票ID 1,並且選擇ID 2,生成的表應該是這樣的:

+----------+------------+-----------------------+ 
| poll_id | option_id | percent_correlated | 
+----------+------------+-----------------------+ 
| 4  | 7   | 100     | 
| 2  | 3   | 66.66     | 
| 3  | 6   | 66.66     | 
| 2  | 4   | 33.33     | 
| 3  | 5   | 33.33     | 
| 4  | 8   | 0      | 
+----------+------------+-----------------------+ 

所以基本上,我們」重新標識對輪詢ID 1和選定的選項ID 2作出響應的所有用戶,並且我們正在查看所有其他輪詢以查看他們還選擇了其他哪些百分比選項。

+0

我知道這些表是不完美優化 - 請忽略的是,除非它會顯著影響的解決方案。 – jawns317 2011-03-15 13:10:02

回答

1

還沒有派上用場的情況下進行測試,你可以看到,如果這得到正確的結果:

select 
     poll_id, 
     option_id, 
     ((psum - (sum1 * sum2/n))/sqrt((sum1sq - pow(sum1, 2.0)/n) * (sum2sq - pow(sum2, 2.0)/n))) AS r, 
     n 
from 
(
    select 
     poll_id, 
     option_id, 
     SUM(score) AS sum1, 
     SUM(score_rev) AS sum2, 
     SUM(score * score) AS sum1sq, 
     SUM(score_rev * score_rev) AS sum2sq, 
     SUM(score * score_rev) AS psum, 
     COUNT(*) AS n 
    from 
    (
      select 
       responses.poll_id, 
       responses.option_id, 
       CASE 
        WHEN user_resp.user_id IS NULL THEN SELECT 0 
        ELSE SELECT 1 
       END CASE as score, 
       CASE 
        WHEN user_resp.user_id IS NULL THEN SELECT 1 
        ELSE SELECT 0 
       END CASE as score_rev, 
      from responses left outer join 
        (
         select 
          user_id 
         from 
          responses 
         where 
          poll_id = 1 and 
          option_id = 2 
        )user_resp 
         ON (user_resp.user_id = responses.user_id) 
    ) temp1 
    group by 
     poll_id, 
     option_id 
)components 
1

試錯的幾個小時後,我好不容易纔拼湊的作品正確的查詢:

SELECT poll_id AS p_id, 
     option_id AS o_id, 
     COUNT(*) AS optCount, 

     (SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN 
      (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2'))/
     (SELECT COUNT(*) FROM response WHERE poll_id = p_id AND user_id IN 
      (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) 
     AS percentage 

FROM response 
INNER JOIN 
    (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids 
ON response.user_id = user_ids.user_id 
WHERE poll_id != '1' 

GROUP BY option_id DESC 
ORDER BY percentage DESC, optCount DESC 

基於一個小數據集的測試中,這個查詢看起來是相當快的,但我想修改它,在「IN」子查詢不重複三次。有什麼建議麼?

1

這似乎給正確的結果對我來說:

select poll_stats.poll_id, 
     option_stats.option_id, 
     (100 * option_responses/poll_responses) as percent_correlated 
from (select response.poll_id, 
      count(*) as poll_responses 
     from response selecting_response 
      join response on response.user_id = selecting_response.user_id 
     where selecting_response.poll_id = 1 and selecting_response.option_id = 2 
     group by response.poll_id) poll_stats 
     join (select options.poll_id, 
        options.id as option_id, 
        count(response.id) as option_responses 
      from options 
       left join response on response.poll_id = options.poll_id 
          and response.option_id = options.id 
          and exists (
          select 1 from response selecting_response 
          where selecting_response.user_id = response.user_id 
            and selecting_response.poll_id = 1 
            and selecting_response.option_id = 2) 
      group by options.poll_id, options.id 
      ) as option_stats 
     on option_stats.poll_id = poll_stats.poll_id 
where poll_stats.poll_id <> 1 
order by 3 desc, option_responses desc 
相關問題