Skip to content

Input Validator (`diagnostics.input_validator`)

Utilities for validating input data before modelling. These checks help ensure data quality and prevent common issues prior to model training.

Key checks

  • check_nans(dataframe, target_col, media_cols, control_cols)

    • Verifies that specified columns contain no NaN values.
    • Raises ValueError with a list of offending columns.
  • check_duplicate_columns(dataframe)

    • Ensures column names are unique.
    • Raises ValueError if duplicates are found.
  • check_date_column(date_series, config)

    • Validates chronology, frequency, missing dates, and weekly start day.
    • Attempts to parse using config.get('date_format') if provided.
    • Raises ValueError if unsorted, irregular, or gaps are detected.
  • check_column_variance(dataframe, columns, check_zeros_only=False)

    • Detects columns with zero variance (or all zeros when check_zeros_only=True).
    • Raises ValueError listing columns with issues.

Usage example

import pandas as pd
from src.diagnostics.input_validator import (
check_nans,
check_duplicate_columns,
check_date_column,
check_column_variance,
)
# Example inputs
config = {"date_format": None}
media_cols = ["tv_spend", "search_spend"]
control_cols = ["price", "competitor_index"]
# 1) Duplicate columns
check_duplicate_columns(df)
# 2) Date column integrity
check_date_column(df["date"], config)
# 3) NaNs across core columns
check_nans(df, target_col="revenue", media_cols=media_cols, control_cols=control_cols)
# 4) Zero-variance checks
check_column_variance(df, columns=media_cols + control_cols, check_zeros_only=False)

Notes

  • These validators print structured status messages; errors raise ValueError and are intended to fail fast.
  • For MMM-specific diagnostics (stationarity, VIF, transfer entropy), see the Pre‑Diagnostics Guide.