Skip to content

Explanation: Calibration Diagnostics

Calibration asks whether predictive uncertainty is statistically consistent with realised outcomes.

A model can fit averages well and still be badly calibrated. AMMM therefore runs dedicated calibration diagnostics in 50_diagnostics/.

The Probability Integral Transform (PIT) maps each observed value through its predictive distribution.

If predictive distributions are calibrated, PIT values are approximately Uniform(0,1):

  • flat histogram: good calibration,
  • U-shaped histogram: over-confident (intervals too narrow),
  • inverted-U histogram: under-confident (intervals too wide),
  • systematic shift left/right: biased location.

AMMM prefers LOO-PIT when log_likelihood is available on InferenceData. This avoids double-use of each observation in both fitting and calibration scoring.

If log_likelihood is unavailable, AMMM falls back to PPC-PIT and records this in pit_method.

AMMM also uses a delta-ECDF view:

  • compare empirical PIT CDF against the uniform CDF,
  • inspect departures relative to simultaneous confidence bands.

Persistent departures outside bands indicate material calibration misspecification.

Coverage is checked across nominal levels (0.1 to 0.9):

  • nominal level = intended interval mass,
  • empirical coverage = realised proportion inside those intervals.

Ideal behaviour lies near the 45° line.

AMMM calibration reporting includes:

  • well-calibrated
  • over-confident
  • under-confident
  • biased

The top-level machine-readable field is well_calibrated in calibration_report.json.

ArtefactPurpose
50_diagnostics/calibration_report.jsonMachine-readable diagnosis, well_calibrated, pit_method, and summary metrics.
50_diagnostics/calibration_coverage.csvNominal vs empirical coverage table.
50_diagnostics/calibration_pit_histogram.pngPIT histogram with uniform reference.
50_diagnostics/calibration_pit_ecdf.pngDelta-ECDF with simultaneous confidence bands.
50_diagnostics/calibration_coverage.pngCoverage curve versus ideal diagonal.

Calibration corresponds to gate g5 in the stage-gated workflow. Poor calibration does not automatically invalidate code execution, but it should reduce trust in downstream business interpretation unless resolved.

See Workflow Stages and Methodology.

Good calibration supports probabilistic adequacy, not causal correctness.