Pre-Diagnostics Guide

Overview

The pre-diagnostics module provides automated validation of MMM inputs before model fitting. These tests help identify potential data quality issues that could affect model reliability.

Pre-diagnostics run automatically by default as part of the standard AMMM pipeline, during the DATA EXPLORATION phase.

What is Tested

1. Stationarity Tests (ADF + KPSS)

Purpose: Assess whether the dependent variable (target) exhibits stationarity or a unit root.

Why it matters: Non-stationary time series can lead to spurious correlations and unreliable inference in regression models.

Tests performed:

Augmented Dickey-Fuller (ADF): Tests the null hypothesis of a unit root
Kwiatkowski-Phillips-Schmidt-Shin (KPSS): Tests the null hypothesis of stationarity

Interpretation:

ADF Result	KPSS Result	Conclusion
Reject H₀ (p < 0.05)	Fail to reject H₀ (p ≥ 0.05)	Likely stationary
Fail to reject H₀ (p ≥ 0.05)	Reject H₀ (p < 0.05)	Likely unit root
Other combinations	Other combinations	Inconclusive

Remediation if unit root detected:

First differencing: Δy_t = y_t - y_{t-1}
Detrending: Remove linear or polynomial trends
Log transformation: For multiplicative trends

Note: By design, only the target variable is tested for stationarity. There is no requirement for regressors to be stationary in typical MMM applications.

2. Variance Inflation Factor (VIF)

Purpose: Detect multicollinearity among regressors (media spend channels + control variables).

Why it matters: High multicollinearity inflates coefficient variance, making it difficult to isolate individual channel effects.

Interpretation:

VIF Value	Severity	Action
VIF < 5	Low multicollinearity	No action needed
5 ≤ VIF < 10	Moderate multicollinearity	Monitor closely
VIF ≥ 10	High multicollinearity	Flagged - consider remediation

Remediation if high VIF detected:

Remove or combine highly correlated channels
Principal Component Analysis (PCA) on correlated features
Ridge regression or other regularisation techniques
Domain knowledge to select most important variables

Additional metrics:

Tolerance (1/VIF): Lower values indicate higher multicollinearity
Correlation matrix (max): Highest pairwise correlation for each variable

3. Transfer Entropy

Purpose: Detect directional information flow between media channels (X) and the target variable (Y).

Why it matters: Transfer entropy provides a non-linear, model-free measure of predictive relationships, complementing traditional correlation analysis.

What is computed:

TE(X→Y): Information flow from channel X to target Y
TE(Y→X): Information flow from target Y to channel X
p-values: Statistical significance via permutation test (200 permutations by default)

Direction classification:

Condition	Direction	Interpretation
TE(X→Y) significant AND TE(X→Y) > TE(Y→X)	x→y	X likely predicts Y
TE(Y→X) significant AND TE(Y→X) > TE(X→Y)	y→x	Y likely predicts X (reverse causality?)
Both significant	bidirectional	Mutual predictive relationship
Neither significant	none	No strong directional relationship

Important Caveats:

⚠️ This implementation uses pairwise (unconditional) transfer entropy

Does NOT control for confounding variables
Cannot establish true causality
May detect spurious relationships due to common drivers

⚠️ Interpretation guidance:

Use TE as an exploratory tool, not confirmatory evidence
Significant TE(X→Y) suggests X may have predictive value for Y
Always combine with domain knowledge and theoretical understanding
For rigourous causal analysis, consider conditional TE or structural models

Optional: Include control variables in TE analysis by setting te_include_controls_in_x=True in the orchestrator function.

Output Files

All diagnostics save results to results/csv/.

For the complete specification of each CSV (column names and meanings), see the Reference Output Files page:

{ref}stationarity_summary.csv <stationarity_summarycsv>
{ref}vif_summary.csv <vif_summarycsv>
{ref}transfer_entropy_summary.csv <transfer_entropy_summarycsv>

1. `stationarity_summary.csv`

Column	Description
`variable`	Variable name (target column)
`adf_stat`	ADF test statistic
`adf_pvalue`	ADF p-value
`adf_usedlag`	Number of lags used in ADF test
`adf_nobs`	Number of observations used
`kpss_stat`	KPSS test statistic
`kpss_pvalue`	KPSS p-value
`kpss_lags`	Number of lags used in KPSS test
`adf_stationary`	Boolean: ADF rejects unit root (p < 0.05)
`kpss_nonstationary`	Boolean: KPSS rejects stationarity (p < 0.05)
`stationarity_conclusion`	Combined interpretation

See reference: {ref}stationarity_summary.csv <stationarity_summarycsv>

2. `vif_summary.csv`

Column	Description
`variable`	Variable name
`vif`	Variance Inflation Factor
`tolerance`	1 / VIF
`corr_max`	Maximum absolute pairwise correlation
`flag_high_vif`	Boolean: VIF > 10

See reference: {ref}vif_summary.csv <vif_summarycsv>

3. `transfer_entropy_summary.csv`

Column	Description
`variable`	Predictor variable name
`te_x_to_y`	Transfer entropy from X to Y
`te_y_to_x`	Transfer entropy from Y to X
`p_x_to_y`	p-value for X→Y
`p_y_to_x`	p-value for Y→X
`significant_x_to_y`	Boolean: p_x_to_y < 0.05
`significant_y_to_x`	Boolean: p_y_to_x < 0.05
`direction`	Directional classification

See reference: {ref}transfer_entropy_summary.csv <transfer_entropy_summarycsv>

Quick read example

import pandas as pd

stationarity = pd.read_csv('results/csv/stationarity_summary.csv')
vif = pd.read_csv('results/csv/vif_summary.csv')
te = pd.read_csv('results/csv/transfer_entropy_summary.csv')

print(stationarity.head())
print(vif.sort_values('vif', ascending=False).head())
print(te.head())

Integration

Automatic Execution

Pre-diagnostics run automatically when you execute:

python runme.py

The diagnostics execute during the DATA EXPLORATION phase, after media spend visualisations and before model fitting.

Programmatic Usage

You can also run diagnostics independently:

from src.diagnostics.pre_diagnostics import run_all_pre_diagnostics
import pandas as pd

# Load your data and config
data = pd.read_csv('your_data.csv')
config = {
    'date_col': 'date',
    'target_col': 'sales',
    'media': [
        {'display_name': 'TV', 'spend_col': 'tv_spend'},
        {'display_name': 'Digital', 'spend_col': 'digital_spend'}
    ],
    'extra_features_cols': ['price', 'competitor_activity']
}

# Run all diagnostics
result_paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir='results'
)

# Print saved file paths
for filename, path in result_paths.items():
    print(f"{filename}: {path}")

Individual Tests

You can run tests individually for more control:

from src.diagnostics.pre_diagnostics import (
    run_stationarity_tests,
    run_vif_tests,
    run_transfer_entropy
)

# Stationarity test on target only
stationarity_df = run_stationarity_tests(
    data=data,
    date_col='date',
    cols=['sales']
)

# VIF test on regressors
vif_df = run_vif_tests(
    data=data,
    cols=['tv_spend', 'digital_spend', 'price']
)

# Transfer entropy
te_df = run_transfer_entropy(
    data=data,
    date_col='date',
    x_cols=['tv_spend', 'digital_spend'],
    y_col='sales',
    permutations=200  # Configurable
)

Advanced Configuration

# Include controls in transfer entropy analysis
result_paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir='results',
    te_include_controls_in_x=True,  # Test controls → target
    te_kwargs={'permutations': 500, 'bins': 10}  # Custom TE settings
)

# Custom stationarity test settings
result_paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir='results',
    stationarity_kwargs={
        'adf_regression': 'ct',  # Include trend in ADF
        'kpss_regression': 'ct'  # Include trend in KPSS
    }
)

Error Handling

The pre-diagnostics module is designed to be non-fatal:

If a diagnostic fails, it writes error information to the CSV
The pipeline continues with model fitting
Warnings are logged for non-critical issues (e.g., constant series, insufficient data)

Performance Considerations

Transfer Entropy is the most computationally expensive test
Default settings: 200 permutations, 8 bins
For large datasets or many channels, consider:
- Reducing permutations (min 50 for exploratory analysis)
- Running TE separately on a subset of channels
- Using fewer bins for discretisation

Typical runtime for demo data (~80 weeks, 7 channels):

Stationarity: <1 second
VIF: <1 second
Transfer Entropy: 10-30 seconds

Best Practices

Always review stationarity results: Non-stationary targets can invalidate regression assumptions
Flag high VIF early: Multicollinearity issues are easier to address before model fitting
Use TE as exploratory tool: Complement with domain knowledge and economic theory
Document findings: Keep notes on which diagnostics flagged issues and how you addressed them
Iterate: Run diagnostics again after data transformations or feature engineering

References

Stationarity: Dickey & Fuller (1979); Kwiatkowski et al. (1992)
VIF: Marquaridt (1970); O’Brien (2007)
Transfer Entropy: Schreiber (2000); Bossomaier et al. (2016)

Limitations and Future Extensions

Current limitations:

TE is pairwise (unconditional) only
No automatic remediation suggestions
Fixed significance threshold (α = 0.05)

Planned extensions:

Conditional transfer entropy (control for confounders)
Multivariate TE
Automated data transformation recommendations
Time-varying diagnostics (rolling window analysis)

For questions or issues, please consult the main AMMM documentation or raise an issue on GitHub.