Guide: Cross-Validation and Predictive Evaluation

This guide explains how to evaluate AMMM V2 predictive performance using LOO diagnostics and hold-out checks.

What This Guide Covers

Running and interpreting PSIS-LOO
Comparing alternative model specifications
Hold-out validation workflow
Stage artefacts to inspect

1. LOO in AMMM V2

AMMM computes LOO-related artefacts during workflow execution when log_likelihood is available on InferenceData.

Check:

50_diagnostics/ELPD.txt
50_diagnostics/ELPD_summary.csv
50_diagnostics/pareto_k_summary.json
50_diagnostics/pareto_k.png

Interpretation:

Higher elpd_loo is better (for models fit on the same dataset).
pareto_k > 0.7 indicates potentially unreliable local approximations.

2. Manual LOO Calculation (Notebook/Script)

import arviz as az
from pathlib import Path

from src.core.mmm_model_v2 import DelayedSaturatedMMMv2

model_path = Path("results/20_model_fit/model.nc")
if not model_path.exists():
    raise FileNotFoundError("Expected results/20_model_fit/model.nc")

model = DelayedSaturatedMMMv2.load(str(model_path))
idata = getattr(model, "idata", None)
if idata is None:
    raise RuntimeError("Loaded model has no idata.")

loo_results = az.loo(idata)
print(loo_results)

3. Comparing Two Models

import arviz as az
from pathlib import Path

from src.core.mmm_model_v2 import DelayedSaturatedMMMv2

model_a = DelayedSaturatedMMMv2.load("results_model_a/20_model_fit/model.nc")
model_b = DelayedSaturatedMMMv2.load("results_model_b/20_model_fit/model.nc")

cmp = az.compare({"model_a": model_a.idata, "model_b": model_b.idata})
print(cmp)
az.plot_compare(cmp)

Use this only when both models were fit on the same target and time window.

4. Hold-Out Validation

Set train_test_ratio < 1.0 in config, then run the pipeline.

Inspect:

30_model_assessment/model_fit_predictions.png
30_model_assessment/model_fit_metrics.csv

These summarise in-sample and (when available) out-of-sample fit behaviour.

5. Validation Checks Before Business Use

50_diagnostics/convergence_report.json -> converged
50_diagnostics/calibration_report.json -> well_calibrated
50_diagnostics/pareto_k_summary.json -> ok

If these checks fail, treat decomposition/optimisation outputs as provisional.

Note: good predictive diagnostics do not imply causal validity.