Skip to content

Calibration Assessment (`src.diagnostics.calibration`)

run_calibration_assessment evaluates predictive calibration from posterior predictive draws and writes both diagnostic plots and a machine-readable calibration report to 50_diagnostics/. It prefers LOO-PIT when log-likelihood is available.

from src.diagnostics.calibration import run_calibration_assessment
run_calibration_assessment(
model: Any,
X_train: pd.DataFrame,
y_train: pd.Series | np.ndarray,
config: dict[str, Any],
results_dir: str,
) -> dict[str, Any]
ParameterDescription
modelFitted model with idata; can generate posterior predictive if missing.
X_trainTraining feature matrix (used when generating posterior predictive samples).
y_trainTraining target array/series.
configRun configuration dictionary.
results_dirRun root directory; outputs are saved in 50_diagnostics/.
FilenameStage folderDescription
calibration_report.json50_diagnostics/Machine-readable summary including well_calibrated, diagnosis, pit_method.
calibration_coverage.csv50_diagnostics/Nominal vs empirical coverage table with deviation.
calibration_pit_histogram.png50_diagnostics/PIT histogram against ideal uniform density.
calibration_pit_ecdf.png50_diagnostics/Delta-ECDF with ~99% confidence bands.
calibration_coverage.png50_diagnostics/Empirical vs nominal coverage plot (identity line reference).

The function selects PIT strategy in this order:

  1. loo_pit when idata.log_likelihood exists.
  2. ppc_pit when log-likelihood is absent.
  3. ppc_pit (fallback) if az.loo_pit(...) fails at runtime.

This choice is recorded in calibration_report.json as pit_method.

The returned diagnosis is rule-based from PIT moments and coverage deviation:

PatternReported diagnosis
Near-uniform PIT and small coverage errorwell-calibrated
Mean PIT shiftbiased (systematic location shift)
Empirical intervals too narrowover-confident (predictions too certain) or over-confident (intervals too narrow)
Empirical intervals too wideunder-confident (predictions too uncertain) or under-confident (intervals too wide)
Missing finite PIT valuesinsufficient data

Interpretation shortcut:

  • U-shaped PIT tends to indicate over-confidence.
  • Hump-shaped PIT tends to indicate under-confidence.

If the model uses in-graph target scaling, calibration computations attempt inverse-transform through the scaling strategy resolver so plots and coverage are evaluated on the original target scale.

from src.diagnostics.calibration import run_calibration_assessment
cal_report = run_calibration_assessment(
model=driver.model,
X_train=driver.X_train,
y_train=driver.y_train,
config=driver.config,
results_dir=driver.results_dir,
)
print(cal_report["well_calibrated"], cal_report["diagnosis"], cal_report.get("pit_method"))
  • Stage: 50_diagnostics/.
  • Feeds calibration gate g5 via calibration_report.json.
  • well_calibrated is the machine-readable calibration outcome consumed downstream for interpretation risk signalling.