Skip to content

Input Validator (`src.diagnostics.input_validator`)

input_validator.py provides fail-fast data integrity checks that run before staged diagnostics and model fitting. It does not write artefacts; it raises ValueError when a critical input issue is detected.

from src.diagnostics.input_validator import (
check_nans,
check_duplicate_columns,
check_date_column,
check_column_variance,
)
check_nans(
dataframe: pd.DataFrame,
target_col: str,
media_cols: list[str],
control_cols: list[str],
) -> None
check_duplicate_columns(
dataframe: pd.DataFrame,
) -> None
check_date_column(
date_series: pd.Series,
config: dict[str, Any],
) -> None
check_column_variance(
dataframe: pd.DataFrame,
columns: list[str],
check_zeros_only: bool = False,
) -> None
FunctionKey parameters
check_nanstarget_col, media_cols, control_cols define which columns must be non-null.
check_duplicate_columnsChecks duplicate column names in dataframe.columns.
check_date_columnValidates date parseability, sort order, inferred frequency, missing dates, and weekly start-day consistency. Reads config.get("date_format") if needed.
check_column_varianceChecks constant/all-zero columns over columns. If check_zeros_only=True, only all-zero columns are flagged.

This module does not produce files and does not target a stage folder.

OutputStage folderDescription
NoneN/AValidation runs in-memory and raises exceptions on failure.
CheckFailure meaningTypical action
NaN checkMissing values in required model columnsImpute, drop, or fix upstream extract/joins before fitting.
Duplicate columnsAmbiguous feature referencesDeduplicate headers before preprocessing.
Date validationIrregular or unsorted time indexCorrect sort order, parsing, and frequency gaps.
Variance checkConstant/all-zero regressorsRemove or repair non-informative predictors.
import pandas as pd
from src.diagnostics.input_validator import (
check_nans,
check_duplicate_columns,
check_date_column,
check_column_variance,
)
config = {"date_format": "%Y-%m-%d"}
media_cols = ["tv_spend", "search_spend"]
control_cols = ["price_index", "competitor_sales"]
check_duplicate_columns(df)
check_date_column(df["DATE"], config)
check_nans(df, target_col="revenue", media_cols=media_cols, control_cols=control_cols)
check_column_variance(df, columns=media_cols + control_cols, check_zeros_only=False)
  • This validator runs before stage-folder artefacts are produced.
  • It acts as an entry condition for the staged workflow and must pass before 10_pre_diagnostics/ and later gate checks (g1 to g6) are meaningful.
  • Pass/fail behaviour is exception-based (ValueError) rather than report-based.