Branko's accepted solution is great (thanks!). However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.
Let's recap. The original question can perhaps be generalized as follows:
Given an map of ids and relative weights, create a query that returns a random id in the map, but with a probability proportional to its relative weight.
Note the emphasis on relative weights, not percent. As Branko points out in his answer, using relative weights will work for anything, including percents.
Now, consider some test data, which we'll put in a temporary table:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.
The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
This will result in the following output:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way, and it does make the final step of choosing the id trivial:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
Now, let's put it all together with a test that ensures the query is returning data with the expected distribution. We'll use generate_series()
to generate a random number a million times:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
This will result in output similar to the following:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
Which, as you can see, tracks the expected distribution perfectly.
Performance
The query above is quite performant. Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
Adaptation to generate test data
I often use a variation of the query above when generating test data for unit/integration tests. The idea is to generate random data that approximates a probability distribution that tracks reality.
In that situation I find it useful to compute the start and end distributions once and storing the results in a table:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.
I can even wrap it all in a function that I can call any time I want to get a random id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
Window function frames
It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!