Input Validator (`diagnostics.input_validator`)
Utilities for validating input data before modelling. These checks help ensure data quality and prevent common issues prior to model training.
Key checks
-
check_nans(dataframe, target_col, media_cols, control_cols)- Verifies that specified columns contain no NaN values.
- Raises
ValueErrorwith a list of offending columns.
-
check_duplicate_columns(dataframe)- Ensures column names are unique.
- Raises
ValueErrorif duplicates are found.
-
check_date_column(date_series, config)- Validates chronology, frequency, missing dates, and weekly start day.
- Attempts to parse using
config.get('date_format')if provided. - Raises
ValueErrorif unsorted, irregular, or gaps are detected.
-
check_column_variance(dataframe, columns, check_zeros_only=False)- Detects columns with zero variance (or all zeros when
check_zeros_only=True). - Raises
ValueErrorlisting columns with issues.
- Detects columns with zero variance (or all zeros when
Usage example
import pandas as pdfrom src.diagnostics.input_validator import ( check_nans, check_duplicate_columns, check_date_column, check_column_variance,)
# Example inputsconfig = {"date_format": None}media_cols = ["tv_spend", "search_spend"]control_cols = ["price", "competitor_index"]
# 1) Duplicate columnscheck_duplicate_columns(df)
# 2) Date column integritycheck_date_column(df["date"], config)
# 3) NaNs across core columnscheck_nans(df, target_col="revenue", media_cols=media_cols, control_cols=control_cols)
# 4) Zero-variance checkscheck_column_variance(df, columns=media_cols + control_cols, check_zeros_only=False)Notes
- These validators print structured status messages; errors raise
ValueErrorand are intended to fail fast. - For MMM-specific diagnostics (stationarity, VIF, transfer entropy), see the Pre‑Diagnostics Guide.