T-SQL中的隨機加權選擇

15

丹恩的答案包括一個引入平方律的自我連接。 (n*n/2)行之後的表中有n行。

更理想的是能夠解析表格一次。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END, 
    @weight_point = @weight_point - [table].weight 
FROM 
    @table [table] 
ORDER BY 
    [table].Weight DESC

這將通過表格，設置@id每個記錄的id值，而在同一時間遞減@weight點。最終，@weight_point將消極。這意味着前面所有權重的SUM大於隨機選擇的目標值。這是我們想要的記錄，因此從那時起，我們將@id設置爲自己（忽略表中的任何ID）。

這隻會貫穿表格一次，但即使所選值是第一條記錄，也必須貫穿整個表格。因爲平均倉位是通過表的一半（和更少，如果下令升重量）編寫循環也可能會被更快......（特別是如果該權重是在公共組）：

DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT @next_weight = MAX(weight) FROM @table 
SELECT @row_count = COUNT(*) FROM @table 
SET @weight_point = @weight_point - (@next_weight * @row_count) 

WHILE (@weight_point > 0) 
BEGIN 
    SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight 
    SELECT @row_count = COUNT(*) FROM @table WHERE weight = @next_weight 
    SET @weight_point = @weight_point - (@next_weight * @row_count) 
END 

-- # Once the @weight_point is less than 0, we know that the randomly chosen record 
-- # is in the group of records WHERE [table].weight = @next_weight 

SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END, 
    @row_count = @row_count - 1 
FROM 
    @table [table] 
WHERE 
    [table].weight = @next_weight 
ORDER BY 
    [table].Weight DESC

來源

2009-01-18 01:26:24 MatBailie

+0

我做了一些經驗性測試，發現您的解決方案對輸入數據非常敏感。我的測試數據 - 權重：2，998，迭代次數：1M。重量2應該拿起約2k次。如果表中權重的順序是升序（2,998），那麼它的權重2就是大約500次。如果您反轉訂單，則約爲2500次。如果您將`ROUND`更改爲`FLOOR`，升序約爲1000次，重量降低約2000倍。這是適當的概率。我已經更新了你的答案。 – 2016-09-01 19:01:46

7

您只需要對所有準備行的權重求和，然後在該總和中選擇一個隨機點，然後選擇與該選定點協調的記錄（每個記錄遞增地攜帶累加權重和）。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT TOP 1 @id = t1.id 
FROM @table t1, @table t2 
WHERE t1.id >= t2.id 
GROUP BY t1.id 
HAVING SUM(t2.weight) >= @weight_point 
ORDER BY t1.id 

SELECT @id

來源

2008-09-12 07:45:06 Dane

+1

我做了一個小您和MatBailie解決方案的基準，看起來您的時間大約是Mat的兩倍。在2行和1千萬次迭代的桌子上，Mat的解決方案耗時約45秒，解決方案約需85秒。 – 2016-09-01 18:40:50

3

的「逐步揹着一個accumlating [原文]加權求和」部分是昂貴的，如果你有很多的記錄。如果你已經有了很大的分數/重量範圍（例如：範圍足夠寬，大多數記錄重量是獨一無二的，1-5顆星可能不會削減它），你可以做這樣的事情來選擇一個重量值。我使用VB.Net在這裏展示，但是這可以很容易地在純SQL來完成，以及：

Function PickScore() 
    'Assume we have a database wrapper class instance called SQL and seeded a PRNG already 
    'Get count of scores in database 
    Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]") 
    ' You could also approximate this with just the number of records in the table, which might be faster. 

    'Random number between 0 and 1 with ScoreCount possible values 
    Dim rand As Double = Random.GetNext(ScoreCount)/ScoreCount 

    'Use the equation y = 1 - x^3 to skew results in favor of higher scores 
    ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1 
    rand = 1 - (rand * rand * rand) 

    'Now we need to map the (0,1] vector to [1,Maxscore]. 
    'Just find MaxScore and mutliply by rand 
    Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs") 
    Return MaxScore * rand 
End Function

運行它，並選擇具有最大記錄得分比返回的重量更輕。如果分數超過一個記錄，請隨機挑選。這裏的優點是你不必保留任何總和，並且你可以調整適合你的口味的概率公式。但是，再次，分數越大，它的效果就越好。

來源

2008-09-12 13:41:07

2

使用隨機數生成器完成此操作的方法是集成概率密度函數。使用一組離散值可以計算前綴和（所有值的總和），並將其存儲起來。使用此選項，您可以選擇大於隨機數的最小前綴總和（聚合到日期）值。

在數據庫上插入後的後續值必須更新。如果數據集的更新和大小的相對頻率沒有使得這樣做成本過高，這意味着可以從單個s-argable（可以通過索引查找來解析的謂詞）查詢中獲取適當的值。

來源

2008-09-18 14:57:55 ConcernedOfTunbridgeWells