30

I want to plot an approximation of probability density function based on a sample that I have; The curve that mimics the histogram behaviour. I can have samples as big as I want.

7
  • What is your sample? Is it a distribution, or actual data?
    – askewchan
    Mar 14, 2013 at 16:58
  • 1
    I don't understand how could somebody vote down this question?! I mean based on what???
    – Cupitor
    Mar 15, 2013 at 14:27
  • 2
    usually on Stack Overflow people will upvote questions that are immediately clear and also show some attempt by the asker to answer their own question. "What have you tried?" Usually downvotes are accompanied by comments though, so I'm not sure why that didn't happen in this case.
    – askewchan
    Mar 15, 2013 at 15:27
  • I see. Thanks for explanation... Sometimes these things make me think democracy sucks!
    – Cupitor
    Mar 15, 2013 at 15:53
  • heh, yeah. the faq are pretty useful for outlining what people expect to be (and not to be) in a question. And aside from 'reputation' more upvotes will make your questions get more visibility and attention.
    – askewchan
    Mar 15, 2013 at 16:03

2 Answers 2

43

If you want to plot a distribution, and you know it, define it as a function, and plot it as so:

import numpy as np
from matplotlib import pyplot as plt

def my_dist(x):
    return np.exp(-x ** 2)

x = np.arange(-100, 100)
p = my_dist(x)
plt.plot(x, p)
plt.show()

If you don't have the exact distribution as an analytical function, perhaps you can generate a large sample, take a histogram and somehow smooth the data:

import numpy as np
from scipy.interpolate import UnivariateSpline
from matplotlib import pyplot as plt

N = 1000
n = N//10
s = np.random.normal(size=N)   # generate your data sample with N elements
p, x = np.histogram(s, bins=n) # bin it into n = N//10 bins
x = x[:-1] + (x[1] - x[0])/2   # convert bin edges to centers
f = UnivariateSpline(x, p, s=n)
plt.plot(x, f(x))
plt.show()

You can increase or decrease s (smoothing factor) within the UnivariateSpline function call to increase or decrease smoothing. For example, using the two you get: dist to func

9
  • that doesn't help in my case. I already wrote my sampling function and it is not exact for samples of size one lets say!
    – Cupitor
    Mar 14, 2013 at 17:04
  • Then I think you should edit your question to be more clear. This answers your question assuming you "have the distribution".
    – askewchan
    Mar 14, 2013 at 17:05
  • 1
    @Naji Sorry about that, it should work now, with a working example of a normal distribution.
    – askewchan
    Mar 14, 2013 at 17:30
  • 1
    you should use n =int( N/10) to avoid error from float type
    – Ajay Ohri
    Feb 19, 2018 at 8:51
  • 1
    Good point @Ajay, I should update this! When I wrote this five years ago, n was an int because I was using python 2, and most of the audience probably was too.
    – askewchan
    Feb 20, 2018 at 1:23
29

What you have to do is to use the gaussian_kde from the scipy.stats.kde package.

given your data you can do something like this:

from scipy.stats.kde import gaussian_kde
from numpy import linspace
# create fake data
data = randn(1000)
# this create the kernel, given an array it will estimate the probability over that values
kde = gaussian_kde( data )
# these are the values over wich your kernel will be evaluated
dist_space = linspace( min(data), max(data), 100 )
# plot the results
plt.plot( dist_space, kde(dist_space) )

The kernel density can be configured at will and can handle N-dimensional data with ease. It will also avoid the spline distorsion that you can see in the plot given by askewchan.

enter image description here

2
  • I am looking for a similar solution. I have a data-set already but I do not know what distribution does it have so I am trying to plot a Probability distribution function using python and I dont happen to know how to plot that. Any help is appreciated in that case.
    – Sitz Blogz
    Mar 16, 2016 at 6:44
  • 2
    @SitzBlogz Let's say your data-set is called data, then just remove the line data = randn(1000) in @EnricoGiampieri answer and you're done! Aug 4, 2016 at 10:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.