1. Introduction
In many real-world scenarios, we cannot always observe the full outcome of a random variable. Instead, we only know that the true value exceeds a certain threshold, this is known as right-censoring.
Let’s consider the following example:
Suppose we’re studying the time people spend waiting in a clinic’s waiting room. We collect data on how long each patient waits. However, due to resource constraints, some observations are censored.
In this scenario:
- Some patients leave early or are no longer tracked after a certain time.
- We might stop recording after a maximum allowed time.
This means for some individuals we observe their exact wait time, and for others, we only know they waited at least some threshold time.
Our goal is to infer the underlying distribution of wait times using both the observed and right-censored data.
The transparent histogram shows the full (uncensored) Gamma-distributed wait times, while the solid grey histogram shows the observed data after censoring. The two vertical lines represent the aggressive and passive censoring thresholds. Notice how much of the right tail of the true distribution is lost due to censoring. This illustrates the challenge: we must reconstruct the full distribution from partial data.
2. Modelling Wait Times: The Maths
We assume that the true wait times, , follow a Gamma distribution:
,
where:
- is the shape parameter,
- is the rate parameter (note: Stan uses the rate, not scale).
The dataset is split into:
- Observed data: such that (no censoring)
- Censored data: We only observe the threshold , knowing
The total likelihood becomes:
The second term uses the complementary CDF (CCDF), which Stan provides as gamma_lccdf
.
3. Simulating Censored Data in Python
We simulate samples from a Gamma distribution with known parameters , . Each sample is censored with a randomly selected threshold, drawn from two options:
- Aggressive censoring: threshold
- Passive censoring: threshold
Each data point has a 50% chance of being censored using either threshold.
samples = np.random.gamma(shape=alpha, scale=1/beta, size=N)
use_aggressive = np.random.rand(N) < p_aggressive
thresholds = np.where(use_aggressive, T_aggressive, T_passive)
is_censored = samples > thresholds
This gives us:
complete
: values wherecensored
: censoring thresholds where
The data is saved as a JSON file for Stan.
4. Writing the Stan Model
Stan supports censored data via the *_lccdf
functions. Here, we use gamma_lpdf
for observed values and gamma_lccdf
for censored values:
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
vector<lower=0>[N_obs] obs;
vector<lower=0>[N_cens] cens;
}
parameters {
real<lower=0> alpha;
real<lower=0> beta;
}
model {
target += gamma_lpdf(obs | alpha, beta);
target += gamma_lccdf(cens | alpha, beta);
}
5. Running the Model with CmdStanPy
5.1 Compile the Model
from cmdstanpy import CmdStanModel
model = CmdStanModel(stan_file="models/censored_gamma.stan")
5.2: Simulate and Save the Data
Run the Python script to generate data and save to JSON:
simulate_censored_gamma(...)
prepare_stan_data(...)
5.3: Fit the Model
fit = model.sample(
data="data/gamma_data.json",
output_dir="output/",
chains=4,
parallel_chains=4,
iter_warmup=1000,
iter_sampling=1000,
seed=42
)
6.Visualising Posterior Estimates
We extract samples of alpha
and beta
, compute the posterior mean , and overlay posterior predictive densities.
6.1 Posterior PDF Plot
The grey histogram shows the observed (non-censored) wait times. The blue dashed line shows the posterior mean PDF, with the light blue shaded area representing the 95% credible interval over PDFs sampled from the posterior. The green line shows the true Gamma distribution used to simulate the data. Vertical dashed lines highlight:
- The estimated posterior mean ± 1 standard deviation (blue),
- The censoring thresholds (red and orange).
This visual confirms that the model can recover the underlying distribution, even though it only saw partially observed data.
7. Conclusion
This tutorial walked through the full pipeline of:
- Simulating censored data from a known distribution
- Encoding censoring logic in Stan using
lccdf
- Fitting the model with
CmdStanPy
- Visualising the posterior and checking recovery of true parameters
This technique applies widely in survival analysis, industrial reliability, and social science studies where data is frequently incomplete due to observation limits.