25

I want to generate a random number with a given probability but I'm not sure how to:

I need a number between 1 and 3

num = ceil(rand*3);

but I need different values to have different probabilities of generating eg.

0.5 chance of 1
0.1 chance of 2
0.4 chance of 3

I'm sure this is straightforward but I can't think of how to do it.

1

7 Answers 7

46

The simple solution is to generate a number with a uniform distribution (using rand), and manipulate it a bit:

r = rand;
prob = [0.5, 0.1, 0.4];
x = sum(r >= cumsum([0, prob]));

or in a one-liner:

x = sum(rand >= cumsum([0, 0.5, 0.1, 0.4]));

Explanation

Here r is a uniformly distributed random number between 0 and 1. To generate an integer number between 1 and 3, the trick is to divide the [0, 1] range into 3 segments, where the length of each segment is proportional to its corresponding probability. In your case, you would have:

  • Segment [0, 0.5), corresponding to number 1.
  • Segment [0.5, 0.6), corresponding to number 2.
  • Segment [0.6, 1], corresponding to number 3.

The probability of r falling within any of the segments is proportional to the probabilities you want for each number. sum(r >= cumsum([0, prob])) is just a fancy way of mapping an integer number to one of the segments.

Extension

If you're interested in creating a vector/matrix of random numbers, you can use a loop or arrayfun:

r = rand(3); % # Any size you want
x = arrayfun(@(z)sum(z >= cumsum([0, prob])), r);

Of course, there's also a vectorized solution, I'm just too lazy to write it.

4
  • Thanks for your help, is there something I can do when the probabilities are [0,0,1] in this case I need the answer to be 3, but keep getting 1 ? Dec 17, 2012 at 12:42
  • 1
    PS: You forgot to add the 0 on this line -> x = sum(rand >= cumsum([0.5, 0.1, 0.4)); Dec 17, 2012 at 13:05
  • 3
    Vectorized solution: sum(bsxfun(@ge, r, cumsum([0, prob]),2) where r is a column vector and prob a row vector.
    – Oleg
    Oct 20, 2014 at 18:47
  • @OlegKomarov Thanks for the vectorized solution ;)
    – Eitan T
    Jan 12, 2015 at 16:59
9

The answers so far are correct, but slow for large inputs: O(m*n) where n is the number of values and m is the number of random samples. Here is a O(m*log(n)) version that takes advantage of monotonicity of the cumsum result and the binary search used in histc:

% assume n = numel(prob) is large and sum(prob) == 1
r = rand(m,1);
[~,x] = histc(r,cumsum([0,prob]));
3
  • Small note, depending on how histc is implemented this could be O(n+m*log(n)). Though I'd hope since the first output is not used, this isn't the case. Dec 6, 2013 at 18:23
  • 1
    There's an even better O(n + m) solution using the Alias Method. I've implemented it in the sample_discrete.m function. May 8, 2014 at 1:43
  • The link is broken, fyi.
    – kauffmanes
    Feb 7, 2019 at 2:21
5
>> c = cumsum([0.5, 0.1, 0.4]);
>> r = rand(1e5, 1);
>> x = arrayfun(@(x) find(x <= c, 1, 'first'), r);
>> h = hist(x, 1:3)

h =

       49953       10047       40000

x distributed as desired.

2
  • @EitanT, I don't think that sum() is faster than find(..., 'first'). Also, there is no need in adding zero. Please test it. In general case I would only add: assert(c(end) == 1);
    – Serg
    Dec 18, 2012 at 20:45
  • Now that I think about it, my comment is out of place. My apologies.
    – Eitan T
    Dec 18, 2012 at 21:10
5

using randsample function from Statistics and Machine Learning Toolbox, you can generate random numbers with specified probability mass function (pmf):

pmf = [0.5, 0.1, 0.4];
population = 1:3;
sample_size = 1;

random_number = randsample(population,sample_size,true,pmf);

I think this is the easiest method.

4

A slightly more general solution would be:

r=rand;
prob=[.5,.1,.4];
prob=cumsum(prob);
value=[1,2,3];    %values corresponding to the probabilities
ind=find(r<=prob,1,'first');
x=value(ind)
0
0

When the probabilities are nice numbers like this it is possible to do a very simple and performant selection. We repeat population elements such that a uniform selection yields the desired probability distribution. In this case we create a population of 10, with 5 times 1 (0.5 probability of being selected), etc.

p = [1,1,1,1,1,2,3,3,3,3];
x = p(randi(numel(p));

randi takes a second input argument that determines the size of the output (the default is 1), so it’s simple to generate many values from this distribution.

0

A vector solution using rand, cumsum, and min.

r = rand(10,1);
p = [0.5 0.1 0.4];
[~, ind] = min(r >= cumsum(p), [], 2)
  • Randomly sample r from 0..1 using rand. In this case I put my data into a column vector.
  • Put the probabilities for each output index into p.
  • r >= cumsum(p) compares every combination of r and the cumulative probabilities of p. In this case the result is a 2D matrix where each row begins with a series of 1s and ends with a series of 0s. The first 0 indicates the element of p that was randomly selected.
  • min is performed for all rows and returns the column index of the first 0. The third input to min defines the dimension over which to calculate the minimum.

If you want to extend this to n dimensions of r: change the shape of p so that it extends into one more dimension than what r has, and give that dimension as min's third input.

r = rand(3, 5, 7);
p = []; 
p(1,1,1,:) = [0.5 0.1 0.4];
[~, ind] = min(r >= cumsum(p), [], 4)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.