Pre-Diagnostics Guide
Overview
Section titled “Overview”The pre-diagnostics module provides automated validation of MMM inputs before model fitting. These tests help identify potential data quality issues that could affect model reliability.
Pre-diagnostics run automatically by default as part of the standard AMMM pipeline, during the DATA EXPLORATION phase.
What is Tested
Section titled “What is Tested”1. Stationarity Tests (ADF + KPSS)
Section titled “1. Stationarity Tests (ADF + KPSS)”Purpose: Assess whether the dependent variable (target) exhibits stationarity or a unit root.
Why it matters: Non-stationary time series can lead to spurious correlations and unreliable inference in regression models.
Tests performed:
- Augmented Dickey-Fuller (ADF): Tests the null hypothesis of a unit root
- Kwiatkowski-Phillips-Schmidt-Shin (KPSS): Tests the null hypothesis of stationarity
Interpretation:
| ADF Result | KPSS Result | Conclusion |
|---|---|---|
| Reject H₀ (p < 0.05) | Fail to reject H₀ (p ≥ 0.05) | Likely stationary |
| Fail to reject H₀ (p ≥ 0.05) | Reject H₀ (p < 0.05) | Likely unit root |
| Other combinations | Other combinations | Inconclusive |
Remediation if unit root detected:
- First differencing: Δy_t = y_t - y_{t-1}
- Detrending: Remove linear or polynomial trends
- Log transformation: For multiplicative trends
Note: By design, only the target variable is tested for stationarity. There is no requirement for regressors to be stationary in typical MMM applications.
2. Variance Inflation Factor (VIF)
Section titled “2. Variance Inflation Factor (VIF)”Purpose: Detect multicollinearity among regressors (media spend channels + control variables).
Why it matters: High multicollinearity inflates coefficient variance, making it difficult to isolate individual channel effects.
Interpretation:
| VIF Value | Severity | Action |
|---|---|---|
| VIF < 5 | Low multicollinearity | No action needed |
| 5 ≤ VIF < 10 | Moderate multicollinearity | Monitor closely |
| VIF ≥ 10 | High multicollinearity | Flagged - consider remediation |
Remediation if high VIF detected:
- Remove or combine highly correlated channels
- Principal Component Analysis (PCA) on correlated features
- Ridge regression or other regularisation techniques
- Domain knowledge to select most important variables
Additional metrics:
- Tolerance (1/VIF): Lower values indicate higher multicollinearity
- Correlation matrix (max): Highest pairwise correlation for each variable
3. Transfer Entropy
Section titled “3. Transfer Entropy”Purpose: Detect directional information flow between media channels (X) and the target variable (Y).
Why it matters: Transfer entropy provides a non-linear, model-free measure of predictive relationships, complementing traditional correlation analysis.
What is computed:
- TE(X→Y): Information flow from channel X to target Y
- TE(Y→X): Information flow from target Y to channel X
- p-values: Statistical significance via permutation test (200 permutations by default)
Direction classification:
| Condition | Direction | Interpretation |
|---|---|---|
| TE(X→Y) significant AND TE(X→Y) > TE(Y→X) | x→y | X likely predicts Y |
| TE(Y→X) significant AND TE(Y→X) > TE(X→Y) | y→x | Y likely predicts X (reverse causality?) |
| Both significant | bidirectional | Mutual predictive relationship |
| Neither significant | none | No strong directional relationship |
Important Caveats:
⚠️ This implementation uses pairwise (unconditional) transfer entropy
- Does NOT control for confounding variables
- Cannot establish true causality
- May detect spurious relationships due to common drivers
⚠️ Interpretation guidance:
- Use TE as an exploratory tool, not confirmatory evidence
- Significant TE(X→Y) suggests X may have predictive value for Y
- Always combine with domain knowledge and theoretical understanding
- For rigourous causal analysis, consider conditional TE or structural models
Optional: Include control variables in TE analysis by setting te_include_controls_in_x=True in the orchestrator function.
Output Files
Section titled “Output Files”All diagnostics save results to results/csv/.
For the complete specification of each CSV (column names and meanings), see the Reference Output Files page:
- {ref}
stationarity_summary.csv <stationarity_summarycsv> - {ref}
vif_summary.csv <vif_summarycsv> - {ref}
transfer_entropy_summary.csv <transfer_entropy_summarycsv>
1. stationarity_summary.csv
Section titled “1. stationarity_summary.csv”| Column | Description |
|---|---|
variable | Variable name (target column) |
adf_stat | ADF test statistic |
adf_pvalue | ADF p-value |
adf_usedlag | Number of lags used in ADF test |
adf_nobs | Number of observations used |
kpss_stat | KPSS test statistic |
kpss_pvalue | KPSS p-value |
kpss_lags | Number of lags used in KPSS test |
adf_stationary | Boolean: ADF rejects unit root (p < 0.05) |
kpss_nonstationary | Boolean: KPSS rejects stationarity (p < 0.05) |
stationarity_conclusion | Combined interpretation |
See reference: {ref}stationarity_summary.csv <stationarity_summarycsv>
2. vif_summary.csv
Section titled “2. vif_summary.csv”| Column | Description |
|---|---|
variable | Variable name |
vif | Variance Inflation Factor |
tolerance | 1 / VIF |
corr_max | Maximum absolute pairwise correlation |
flag_high_vif | Boolean: VIF > 10 |
See reference: {ref}vif_summary.csv <vif_summarycsv>
3. transfer_entropy_summary.csv
Section titled “3. transfer_entropy_summary.csv”| Column | Description |
|---|---|
variable | Predictor variable name |
te_x_to_y | Transfer entropy from X to Y |
te_y_to_x | Transfer entropy from Y to X |
p_x_to_y | p-value for X→Y |
p_y_to_x | p-value for Y→X |
significant_x_to_y | Boolean: p_x_to_y < 0.05 |
significant_y_to_x | Boolean: p_y_to_x < 0.05 |
direction | Directional classification |
See reference: {ref}transfer_entropy_summary.csv <transfer_entropy_summarycsv>
Quick read example
Section titled “Quick read example”import pandas as pd
stationarity = pd.read_csv('results/csv/stationarity_summary.csv')vif = pd.read_csv('results/csv/vif_summary.csv')te = pd.read_csv('results/csv/transfer_entropy_summary.csv')
print(stationarity.head())print(vif.sort_values('vif', ascending=False).head())print(te.head())Integration
Section titled “Integration”Automatic Execution
Section titled “Automatic Execution”Pre-diagnostics run automatically when you execute:
python runme.pyThe diagnostics execute during the DATA EXPLORATION phase, after media spend visualisations and before model fitting.
Programmatic Usage
Section titled “Programmatic Usage”You can also run diagnostics independently:
from src.diagnostics.pre_diagnostics import run_all_pre_diagnosticsimport pandas as pd
# Load your data and configdata = pd.read_csv('your_data.csv')config = { 'date_col': 'date', 'target_col': 'sales', 'media': [ {'display_name': 'TV', 'spend_col': 'tv_spend'}, {'display_name': 'Digital', 'spend_col': 'digital_spend'} ], 'extra_features_cols': ['price', 'competitor_activity']}
# Run all diagnosticsresult_paths = run_all_pre_diagnostics( data=data, config=config, results_dir='results')
# Print saved file pathsfor filename, path in result_paths.items(): print(f"{filename}: {path}")Individual Tests
Section titled “Individual Tests”You can run tests individually for more control:
from src.diagnostics.pre_diagnostics import ( run_stationarity_tests, run_vif_tests, run_transfer_entropy)
# Stationarity test on target onlystationarity_df = run_stationarity_tests( data=data, date_col='date', cols=['sales'])
# VIF test on regressorsvif_df = run_vif_tests( data=data, cols=['tv_spend', 'digital_spend', 'price'])
# Transfer entropyte_df = run_transfer_entropy( data=data, date_col='date', x_cols=['tv_spend', 'digital_spend'], y_col='sales', permutations=200 # Configurable)Advanced Configuration
Section titled “Advanced Configuration”# Include controls in transfer entropy analysisresult_paths = run_all_pre_diagnostics( data=data, config=config, results_dir='results', te_include_controls_in_x=True, # Test controls → target te_kwargs={'permutations': 500, 'bins': 10} # Custom TE settings)
# Custom stationarity test settingsresult_paths = run_all_pre_diagnostics( data=data, config=config, results_dir='results', stationarity_kwargs={ 'adf_regression': 'ct', # Include trend in ADF 'kpss_regression': 'ct' # Include trend in KPSS })Error Handling
Section titled “Error Handling”The pre-diagnostics module is designed to be non-fatal:
- If a diagnostic fails, it writes error information to the CSV
- The pipeline continues with model fitting
- Warnings are logged for non-critical issues (e.g., constant series, insufficient data)
Performance Considerations
Section titled “Performance Considerations”- Transfer Entropy is the most computationally expensive test
- Default settings: 200 permutations, 8 bins
- For large datasets or many channels, consider:
- Reducing permutations (min 50 for exploratory analysis)
- Running TE separately on a subset of channels
- Using fewer bins for discretisation
Typical runtime for demo data (~80 weeks, 7 channels):
- Stationarity: <1 second
- VIF: <1 second
- Transfer Entropy: 10-30 seconds
Best Practices
Section titled “Best Practices”- Always review stationarity results: Non-stationary targets can invalidate regression assumptions
- Flag high VIF early: Multicollinearity issues are easier to address before model fitting
- Use TE as exploratory tool: Complement with domain knowledge and economic theory
- Document findings: Keep notes on which diagnostics flagged issues and how you addressed them
- Iterate: Run diagnostics again after data transformations or feature engineering
References
Section titled “References”- Stationarity: Dickey & Fuller (1979); Kwiatkowski et al. (1992)
- VIF: Marquaridt (1970); O’Brien (2007)
- Transfer Entropy: Schreiber (2000); Bossomaier et al. (2016)
Limitations and Future Extensions
Section titled “Limitations and Future Extensions”Current limitations:
- TE is pairwise (unconditional) only
- No automatic remediation suggestions
- Fixed significance threshold (α = 0.05)
Planned extensions:
- Conditional transfer entropy (control for confounders)
- Multivariate TE
- Automated data transformation recommendations
- Time-varying diagnostics (rolling window analysis)
For questions or issues, please consult the main AMMM documentation or raise an issue on GitHub.