Forecast Comparison Tests¶
Tests for comparing predictive accuracy of forecasting models. Essential for model selection and evaluation in time series analysis.
Validation: All forecast comparison tests are validated against R implementations (forecast package, MCS package).
diebold_mariano¶
Diebold-Mariano test for equal predictive accuracy between two forecasts.
The standard test for comparing out-of-sample forecast accuracy. Tests whether two forecasting models have equal expected loss.
ps.diebold_mariano(
errors1: Union[pl.Expr, str],
errors2: Union[pl.Expr, str],
loss: str = "squared", # "squared", "absolute"
horizon: int = 1, # Forecast horizon
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
Null hypothesis: The two forecasts have equal expected loss (E[d_t] = 0 where d_t = L(e1_t) - L(e2_t)).
Parameters:
- loss="squared": Mean squared error (MSE) - penalizes large errors more
- loss="absolute": Mean absolute error (MAE) - more robust to outliers
- horizon: Forecast horizon h; accounts for autocorrelation in h-step ahead forecast errors
When to use: - Comparing two competing forecasting models - Model selection for time series prediction - Evaluating whether a complex model beats a simple benchmark
Example:
# Compare ARIMA vs neural network forecasts
df.select(ps.diebold_mariano("arima_errors", "nn_errors", horizon=1))
# Multi-step ahead comparison with MAE loss
df.select(ps.diebold_mariano("model1_errors", "model2_errors",
loss="absolute", horizon=4))
Reference: Diebold, F.X. & Mariano, R.S. (1995). "Comparing Predictive Accuracy", Journal of Business & Economic Statistics.
permutation_t_test¶
Permutation-based t-test for comparing two samples (non-parametric).
Uses permutation resampling to compute exact p-values without distributional assumptions. Valid for any sample size.
ps.permutation_t_test(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
alternative: str = "two-sided",
n_permutations: int = 999,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
Null hypothesis: The two samples come from the same distribution.
When to use: - Small samples where t-test assumptions may not hold - Non-normal data - When exact p-values are required
clark_west¶
Clark-West test for comparing nested forecasting models.
Adjusts the Diebold-Mariano test for the case when one model nests another (e.g., comparing AR(1) vs AR(1) + X). The standard DM test is undersized for nested models.
ps.clark_west(
restricted_errors: Union[pl.Expr, str],
unrestricted_errors: Union[pl.Expr, str],
horizon: int = 1,
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
Null hypothesis: The restricted (simpler) model forecasts as well as the unrestricted model.
When to use: - Comparing a benchmark model to an augmented version - Testing if additional predictors improve forecasts - Nested model comparison (e.g., random walk vs model with fundamentals)
Reference: Clark, T.E. & West, K.D. (2007). "Approximately Normal Tests for Equal Predictive Accuracy in Nested Models", Journal of Econometrics.
spa_test¶
Superior Predictive Ability (SPA) test by Hansen (2005).
Tests whether any model in a set significantly outperforms a benchmark, controlling for data-snooping bias when many models are compared.
ps.spa_test(
benchmark_loss: Union[pl.Expr, str],
*model_losses: Union[pl.Expr, str],
n_bootstrap: int = 999,
block_length: float = 5.0,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
Null hypothesis: No model outperforms the benchmark (max expected loss difference ≤ 0).
Parameters:
- block_length: Block length for stationary bootstrap; accounts for serial correlation
When to use: - Comparing many models to a single benchmark - Addressing data-snooping concerns in model selection - Robust model comparison with multiple alternatives
Reference: Hansen, P.R. (2005). "A Test for Superior Predictive Ability", Journal of Business & Economic Statistics.
model_confidence_set¶
Model Confidence Set (MCS) for identifying the set of best-performing models.
Returns a set of models that contains the best model(s) with a given confidence level. Unlike pairwise tests, MCS considers all models simultaneously.
ps.model_confidence_set(
*model_losses: Union[pl.Expr, str],
alpha: float = 0.1,
statistic: str = "range", # "range" or "semi-quadratic"
n_bootstrap: int = 999,
block_length: float = 5.0,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{included: List[Boolean], p_values: List[Float64]}
Interpretation:
- included[i] = True: Model i is in the confidence set (cannot be rejected as best)
- p_values[i]: p-value at which model i would be eliminated
Parameters:
- alpha=0.1: Significance level for model elimination
- statistic="range": Test statistic; "range" is more powerful, "semi-quadratic" more robust
When to use: - Identifying all models that are statistically indistinguishable from the best - Model selection when multiple good models exist - Reporting uncertainty in model rankings
Reference: Hansen, P.R., Lunde, A. & Nason, J.M. (2011). "The Model Confidence Set", Econometrica.
mspe_adjusted¶
MSPE-Adjusted SPA test for comparing nested models.
Combines the Clark-West adjustment for nested models with the SPA framework for multiple comparisons.
ps.mspe_adjusted(
benchmark_errors: Union[pl.Expr, str],
*model_errors: Union[pl.Expr, str],
n_bootstrap: int = 999,
block_length: float = 5.0,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
When to use: - Multiple nested model comparisons - Testing if any of several augmented models beats a simple benchmark
Modern Distribution Tests¶
energy_distance¶
Energy Distance test for comparing two distributions.
A powerful non-parametric test that detects differences in any moment of the distributions (mean, variance, shape). Based on the concept of statistical energy from physics.
ps.energy_distance(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
n_permutations: int = 999,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
Null hypothesis: The two samples come from the same distribution.
Advantages: - Detects differences in location AND scale AND shape - No parametric assumptions - Works well in high dimensions
When to use: - Testing if forecast error distributions differ between models - Detecting distribution shift (covariate shift) - General two-sample comparison
Reference: Székely, G.J. & Rizzo, M.L. (2013). "Energy Statistics", Wiley StatsRef.
mmd_test¶
Maximum Mean Discrepancy (MMD) test with Gaussian kernel.
A kernel-based two-sample test that embeds distributions into a reproducing kernel Hilbert space. Widely used in machine learning for distribution comparison.
ps.mmd_test(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
n_permutations: int = 999,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{statistic: Float64, p_value: Float64}
Null hypothesis: The two samples come from the same distribution.
Advantages: - Powerful against a wide range of alternatives - Detects subtle distributional differences - Foundation of many generative model evaluations (GANs)
When to use: - High-dimensional distribution comparison - Detecting subtle differences in prediction distributions - Model validation in machine learning
Reference: Gretton, A. et al. (2012). "A Kernel Two-Sample Test", JMLR.
Choosing a Test¶
| Situation | Recommended Test |
|---|---|
| Two non-nested forecasting models | diebold_mariano |
| Nested models (restricted vs unrestricted) | clark_west |
| Many models vs one benchmark | spa_test |
| Find all "best" models | model_confidence_set |
| Many nested models | mspe_adjusted |
| General distribution comparison | energy_distance or mmd_test |
| Non-parametric mean comparison | permutation_t_test |
See Also¶
- Parametric Tests - Standard hypothesis tests
- Non-Parametric Tests - Distribution-free alternatives