cumulative distribution plots python

seaborn cumulative distribution
cumulative distribution function
pandas plot cumulative distribution
matplotlib hist
matplotlib plot
reverse cumulative distribution python
numpy cumulative histogram
how to calculate cumulative distribution in python

I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.

I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.

You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:

import numpy as np
import matplotlib.pyplot as plt

# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')

plt.show()

Using histograms to plot a cumulative distribution — Matplotlib 3.1.2 , This shows how to plot a cumulative, normalized histogram as a step function in order to visualize the empirical cumulative distribution function� Using histograms to plot a cumulative distribution¶ This shows how to plot a cumulative, normalized histogram as a step function in order to visualize the empirical cumulative distribution function (CDF) of a sample. We also show the theoretical CDF. A couple of other options to the hist function are demonstrated.

Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:

import numpy as np
import matplotlib.pyplot as plt

# Some fake data:
data = np.random.randn(1000)

sorted_data = np.sort(data)  # Or data.sort(), if data can be modified

# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size))  # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size))  # From the number of data points-1 to 0

plt.show()

Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.

The result is:

You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).

PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:

plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
         np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
         np.arange(sorted_data.size+1))

There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.

scipy.stats.cumfreq — SciPy v1.5.2 Reference Guide, A cumulative histogram is a mapping that counts the cumulative number of observations in all of the bins up to the specified bin. Input array. The number of bins to use for the histogram. Default is 10. X2 = np.sort (data) F2 = np.array (range (N))/float (N) plt.plot (X2, F2) plt.title ('How to calculate and plot a cumulative distribution function ?') plt.savefig ("cumulative_density_distribution_03.png", bbox_inches='tight') plt.close () How to calculate and plot a cumulative distribution function in python ?

After conclusive discussion with @EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:

import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt

def pdf(x, mu=0, sigma=1):
    """
    Calculates the normal distribution's probability density 
    function (PDF).  

    """
    term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
    term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
    return term1 * term2


# Drawing sample date poi
##################################################

# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()

min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))

##################################################




fig = plt.gcf()
fig.set_size_inches(12,11)

# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$') 

plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()

# Cumulative distributions, smooth:
plt.subplot(2,2,2)

plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$') 

plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()


# Probability densities of the sample points function
plt.subplot(2,2,3)

pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')

plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()


# Probability density function
plt.subplot(2,2,4)

x = np.arange(min_val, max_val, 0.05)

pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')

plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()

plt.show()

How to calculate and plot a cumulative distribution function with , We saw in the last video the clarity of bee swarm plots. However, there is a limit to their efficacy Duration: 3:25 Posted: Nov 9, 2016 Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. This article deals with the distribution plots in seaborn which is used for examining univariate and bivariate distributions.

In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :

def hist(data, bins, title, labels, range = None):
  fig = plt.figure(figsize=(15, 8))
  ax = plt.axes()
  plt.ylabel("Proportion")
  values, base, _ = plt.hist( data  , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
  ax_bis = ax.twinx()
  values = np.append(values,0)
  ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
  plt.xlabel(labels)
  plt.ylabel("Proportion")
  plt.title(title)
  ax_bis.legend();
  ax.legend();
  plt.show()
  return

if anyone wonders how it looks like, please take a look (with seaborn activated):

Python tutorial: Cumulative Distribution Functions, A better alternative to histogram is plotting Empirical cumulative distribution functions (ECDFs). ECDFs don't have the binning issue and are� This lesson of the Python Tutorial for Data Analysis covers plotting histograms and box plots with pandas .plot() to visualize the distribution of a dataset. This app works best with JavaScript enabled.

Empirical cumulative distribution function (ECDF) in Python, interactive plots. If running in the Jupyter Notebook, use %matplotlib inline . As a start, we plot the PDF for a t statistic with 20 degrees of freedom: The t distribution object t_dist can also give us the cumulative distribution function ( CDF). Plotting univariate distributions¶. The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

p values from cumulative distribution functions — Tutorials on , Plot empirical cumulative distribution using Matplotlib and Numpy. import numpy as np import matplotlib as plt num_bins = 20 counts,� Empirical cumulative distribution function (ECDF) in Python. Histograms are a great way to visualize a single variable. One of the problems with histograms is that one has to choose the bin size. With a wrong bin size your data distribution might look very different.

Python Recipes for CDFs, As such, it is sometimes called the empirical cumulative distribution function, probability distribution and plotting the histogram is listed below. As an alternative, we can compute an empirical cumulative distribution function, or ECDF. Again, this is best explained by example. Here is a picture of an ECDF of the percentage of swing state

Comments
  • It'd help if you showed your attempts so far - sample input data, desired output etc... Otherwise this reads as a "show me the code" question
  • To extend Jon's comment, people are much happier to help you fix the code you have rather than to generate code from scratch. No matter how buggy and non-functional your code is, show it and explain what a) you expect it to do and b) what it is currently doing.
  • FYI, you forgot to include the np before the cumsum as your np.histogram command implies is needed.
  • @ehsteve fixed answer.
  • Using a histogram is both unnecessarily heavy and imprecise.
  • @EOL but necessary for large arrays else you'll run out of memory.
  • Indeed, but I take that this is not the particular case of the question, which is more about how to get the cumulative distribution than to do it in the case of a large array, and approximately.
  • However for large arrays you want to go with the histogram approach as it doesn't require nearly as much memory. The plt.step method gives me a memory error with my 60 million element array.
  • Agreed. I'm not sure whether the problem lies with plt.step or with the fact that this exact method uses maybe 3 times the memory of the array, or both…
  • I agree: plt.step is probably the more appropriate approach for plotting "counts". One question: wouldn't you have to use plt.step(sorted_data, np.arange(1, data.size+1)) to get the correct counts?
  • @SebastianRaschka: Good point. You are correct. A perfect solution would add this last point. This can be done by duplicating the last abscissa and adding the total count (5) at the last ordinate. I updated the answer, thanks!
  • Thanks for the update. Your workaround looks definitely nicer than mine :)
  • if you expect negative values in your array, you probably want to take the absolute... otherwise the cumulative histogram will look off