Menu

From Priors to Predictions: An Intuitive Introduction to Bayesian Inference

An intuitive introduction to Bayesian inference, from Bayes' theorem and likelihoods to hands-on modelling with Gamma-distributed wait times. We show how Bayesian methods estimate full distributions over parameters, quantify uncertainty, and enable probabilistic predictions, wrapping up with a practical example using Stan.


1. Introduction

Bayesian inference is a framework for statistical inference that provides a systematic way to update our beliefs about the parameters of a model in light of observed data. At its heart is Bayes’ theorem, which relates four key components:

  • Prior distribution p(x)p(\mathbf{x}): Encodes our beliefs about the parameters x\mathbf{x} before seeing the data.
  • Likelihood p(yx)p(\mathbf{y} \mid \mathbf{x}): Measures how well a particular setting of the parameters explains the observed data y\mathbf{y}.
  • Marginal likelihood (evidence) p(y)p(\mathbf{y}): A normalising constant ensuring the posterior distribution is a valid probability distribution.
  • Posterior distribution p(xy)p(\mathbf{x} \mid \mathbf{y}): Our updated beliefs about the parameters after observing data.

Formally, Bayes’ theorem states:

p(xy)=p(yx)p(x)p(y),p(\mathbf{x} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{x}) \, p(\mathbf{x})}{p(\mathbf{y})},

where the marginal likelihood is given by

p(y)=p(yx)p(x)dx.p(\mathbf{y}) = \int p(\mathbf{y} \mid \mathbf{x}) \, p(\mathbf{x}) \, d\mathbf{x}.

Here, xRD\mathbf{x} \in \mathbb{R}^D represents the parameters of interest, and yRDN\mathbf{y} \in \mathbb{R}^{D_N} the observed data. In many realistic problems, computing p(y)p(\mathbf{y}) exactly is intractable, making closed-form solutions for the posterior unavailable. In such cases, we turn to approximate methods, such as Markov Chain Monte Carlo (MCMC), to generate samples from the posterior.


2. Likelihood Intuition

The likelihood function quantifies how plausible the observed data is under different possible parameter values (hypotheses). For continuous data, the likelihood is the value of the probability density function evaluated at the observed points. Though technically not a probability itself, it is often described informally as “the probability of the data given the parameters.”

To build intuition, suppose we believe a dataset was generated from a Gaussian distribution with known variance but unknown mean μ\mu. Each possible μ\mu corresponds to a different hypothesis about the true data-generating process. The likelihood tells us which values of μ\mu make the observed data most plausible. Hypotheses that align closely with the actual data yield higher likelihood values (or lower negative log-likelihoods).

In the interactive plot below, 2,000 samples are drawn from a Gaussian distribution with true mean μ=10\mu = 10 and variance σ2=1\sigma^2 = 1. The slider controls our hypothesis for the mean. As you move the slider away from the true mean, the red curve shifts, and the negative log-likelihood (NLL) value updates in real time. Notice how the NLL increases the further our hypothesis moves from μ=10\mu = 10, indicating a progressively poorer fit to the observed data.

Gaussian likelihood explorer

3. Modelling Wait Times: The Maths

Suppose we record wait times t1,,tn>0t_1, \dots, t_n > 0 in a clinic. A natural model is the Gamma distribution with shape k>0k > 0 and scale θ>0\theta > 0:

f(tk,θ)=tk1et/θΓ(k)θk,t>0.f(t \mid k, \theta) = \frac{t^{k-1} e^{-t/\theta}}{\Gamma(k) \, \theta^{k}}, \quad t > 0.

For independent and identically distributed (i.i.d.) observations, the log-likelihood (LL) is

(k,θt)=(k1)ilnti1θitinklnθnlnΓ(k).\ell(k, \theta \mid \mathbf{t}) = (k-1) \sum_i \ln t_i - \frac{1}{\theta} \sum_i t_i - n k \ln \theta - n \ln \Gamma(k).

A single likelihood value on its own tells us very little. Its absolute magnitude depends on sample size, measurement units (for continuous data), and can even be scaled by constants without affecting the model. Likelihoods become informative only in comparison:

  • Likelihood ratio (LR): for two hypotheses (k1,θ1)(k_1,\theta_1) vs. (k2,θ2)(k_2,\theta_2),

    Λ=L(k1,θ1t)L(k2,θ2t),logΛ=(k1,θ1t)(k2,θ2t).\Lambda=\frac{L(k_1,\theta_1\mid \mathbf t)}{L(k_2,\theta_2\mid \mathbf t)},\qquad \log\Lambda=\ell(k_1,\theta_1\mid \mathbf t)-\ell(k_2,\theta_2\mid \mathbf t).

    If Λ>1\Lambda>1 (or logΛ>0\log\Lambda>0), the data are more compatible with (k1,θ1)(k_1,\theta_1) than (k2,θ2)(k_2,\theta_2).

  • Model comparison: we often compare multiple candidate models via log-likelihood differences (LLDs) or penalized versions like the Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC) that subtract a complexity penalty.

  • Bayesian view: the likelihood shapes the posterior relative to alternatives through

    p(k,θt)L(k,θt)p(k,θ),p(k,\theta\mid \mathbf t)\propto L(k,\theta\mid \mathbf t)\,p(k,\theta),

    again emphasizing that it’s the contrast between parameter values or hypotheses that carries the information.

In short: the likelihood is a ranking rule over hypotheses; by itself, a large or small number doesn’t mean much, differences and ratios do.

3.1 A Detour: Writing Likelihoods in Python

In Python, we can compute the likelihood by evaluating the probability density function (PDF) of a distribution at the observed data points. The scipy.stats module makes this straightforward.

import numpy as np
from scipy.stats import gamma

import numpy as np
from scipy.stats import gamma

def gamma_likelihood(k, theta, data):
    """
    Compute the likelihood of observed data under a Gamma(k, theta) model.

    Parameters
    ----------
    k : float
        Shape parameter (k > 0).
    theta : float
        Scale parameter (theta > 0).
    data : array-like
        Observed values (must be positive).

    Returns
    -------
    float
        The likelihood value (product of PDF values).

    Raises
    ------
    ValueError
        If parameters or data are invalid.
    """
    # Validate parameters
    if not np.isscalar(k) or k <= 0:
        raise ValueError(f"Shape parameter k must be positive. Got {k}")
    if not np.isscalar(theta) or theta <= 0:
        raise ValueError(f"Scale parameter theta must be positive. Got {theta}")

    # Validate data
    data = np.asarray(data, dtype=float)
    if data.size == 0:
        raise ValueError("Data must contain at least one value.")
    if np.any(data <= 0):
        raise ValueError("All observed values must be positive.")
    if np.any(~np.isfinite(data)):
        raise ValueError("Data contains NaN or infinite values.")

    # Compute likelihood
    pdf_values = gamma.pdf(data, a=k, scale=theta)
    return np.prod(pdf_values)

We can then calculate the likelihood of a given dataset by calling the function with chosen parameter values and the observed data:

data = np.array([1.2, 3.4, 2.8, 4.5])
ll = gamma_likelihood(k=2.0, theta=3.0, data=data)
print("Likelihood:", ll)

This direct computation is useful for building intuition about how the likelihood changes as we vary kk and θ\theta. However, in practice, it is often more numerically stable to work with the log-likelihood rather than the raw likelihood, because products of many small numbers can underflow to zero in floating-point arithmetic.

3.2 Building Intuition: Gamma-Distributed Wait Times

The following interactive plot shows the probability density function of a Gamma distribution with shape k=2.0k = 2.0 and scale θ=3.0\theta = 3.0. The mean wait time, μ=6.0\mu = 6.0, is calculated as μ=kθ\mu = k \theta. You can compare how the observed wait times align with this model by overlaying their histogram.

Gamma distribution PDF

4. Sampling from the Posterior

For most interesting models, including those describing the spread of a disease through a community, the posterior distribution cannot be computed exactly. For example, if we model the number of new daily COVID-19 cases as following a Poisson distribution with an unknown infection rate λ\lambda and place a prior on λ\lambda, the resulting posterior often has no closed-form expression. In such cases, we use sampling algorithms such as Markov Chain Monte Carlo (MCMC) to draw representative samples from the posterior.

These samples can be used to:

  • Estimate summaries of the parameters (means, credible intervals).
  • Visualise the distribution of plausible models.
  • Make probabilistic predictions for future data (e.g., how many cases we might see next week).

In the Bayesian framework, inference is therefore about learning a distribution over parameters, not just finding a single optimal set of values. This mindset shift is powerful because it explicitly acknowledges and quantifies uncertainty. Instead of producing a single forecast, we produce a range of possible futures, weighted by their plausibility given the evidence. This richer picture allows decision-makers, whether public health officials or engineers, to balance risks, plan contingencies, and update strategies as new data arrives.

4.1 In Practice: Inferring the Distribution of Wait Times at a Clinic

Below we draw synthetic wait times from a known Gamma distribution and fit the parameters kk and θ\theta using Bayesian inference in the probabilistic programming language Stan. Stan lets us write the model succinctly and then sample from the posterior with state-of-the-art MCMC.

We visualise:

  • an opaque histogram of the observed data (as a density),
  • many faint grey curves: Gamma PDFs computed from posterior draws (k(s),θ(s))(k^{(s)}, \theta^{(s)}), each curve is a plausible hypothesis consistent with the data and the prior,
  • a blue curve: the posterior mean PDF, obtained by averaging across samples (i.e., using kˉ,θˉ\bar{k}, \bar{\theta}); it is not the truth, just a summary,
  • a vertical dashed line at the posterior mean wait time μ^=kˉθˉ\hat{\mu}=\bar{k}\,\bar{\theta}.

The spread of the grey curves shows uncertainty: when data are scarce or noisy, the curves fan out; with more data, they concentrate around the blue summary curve.


5. Conclusion

Bayesian inference offers a powerful, intuitive approach to reasoning under uncertainty, uniting prior beliefs with new evidence to produce richer, more informative results than traditional point estimates. Specifically:

  • Bayesian inference provides a principled framework for updating beliefs in light of new data.
  • The likelihood function measures how well different parameter values explain observed data.
  • Interactive visualisation (Gaussian example) helps build intuition about how the likelihood changes as hypotheses shift from the truth.
  • The Gamma model is a natural choice for modelling strictly positive wait times, and maximum likelihood estimation can recover parameters from synthetic data.
  • Posterior sampling (e.g., via MCMC) allows us to quantify uncertainty by generating a distribution over plausible parameter values.
  • Real-world scenarios, such as modelling COVID-19 spread, illustrate the advantage of Bayesian methods in forecasting and decision-making.
  • Bayesian results enable risk-aware planning by providing a range of possible futures rather than a single prediction.