R Validation¶
polars-statistics is validated against R's reference implementations at the Rust engine level. Every statistical test, regression model, and GLM is tested against exact values produced by R scripts using set.seed() for reproducibility.
Validation Architecture¶
polars-statistics (Python/Polars)
└── anofox-regression (Rust crate) ← regression/GLM validation
└── anofox-statistics (Rust crate) ← statistical tests validation
└── R scripts with set.seed(42)
└── CSV reference data generated by R
└── Rust tests assert against R values
The R validation lives in two upstream Rust crates:
- anofox-regression — OLS, WLS, Ridge, Elastic Net, Quantile, GLMs (Poisson, Binomial, Gamma, Inverse Gaussian, Negative Binomial, Tweedie)
- anofox-statistics — t-tests, ANOVA, Mann-Whitney, Wilcoxon, Kruskal-Wallis, Brunner-Munzel, Pearson/Spearman/Kendall, chi-square, Fisher, Shapiro-Wilk, and more
Reference values are generated by R scripts (Rscript R/generate_refs.R, generate_regression_validation.R, generate_glm_validation.R) and stored as CSV files or Rust constants. Rust tests then assert that computed values match R within strict tolerances.
Statistical Tests¶
Parametric Tests¶
T-Tests¶
Validated against R's t.test() at tolerance 1e-10.
import polars as pl
import polars_statistics as ps
df = pl.DataFrame({"g1": [...], "g2": [...]})
# Welch t-test (default)
df.select(ps.ttest_ind("g1", "g2").alias("t"))
# Paired t-test
df.select(ps.ttest_paired("g1", "g2").alias("t"))
# One-sided
df.select(ps.ttest_ind("g1", "g2", alternative="less").alias("t"))
# With location shift
df.select(ps.ttest_ind("g1", "g2", mu=0.5).alias("t"))
What is validated:
| Output field | R field | Tolerance |
|---|---|---|
statistic |
t statistic | 1e-10 |
df |
degrees of freedom | 1e-10 |
p_value |
p-value (two-sided, less, greater) | 1e-10 |
mean_x, mean_y |
sample means | 1e-10 |
ci_lower, ci_upper |
confidence intervals (90%, 95%, 99%) | 1e-6 |
Yuen's Trimmed Means Test¶
Validated against R's WRS2::yuend() at tolerance 1e-10.
| Output field | Tolerance |
|---|---|
statistic |
1e-10 |
df |
1e-10 |
p_value (two-sided, less, greater) |
1e-10 |
estimate (trimmed mean difference) |
1e-10 |
ci_lower, ci_upper (95% CI) |
1e-6 |
Brown-Forsythe Test¶
Validated against R's car::leveneTest(center="median") at tolerance 1e-10.
| Output field | Tolerance |
|---|---|
statistic (F) |
1e-10 |
df1, df2 |
1e-10 |
p_value |
1e-10 |
ANOVA¶
Validated against R's aov(), oneway.test(), and mauchly.test().
# One-way (Fisher, assumes equal variances)
oneway.test(y ~ group, data = df, var.equal = TRUE)
# One-way (Welch, no equal-variance assumption)
oneway.test(y ~ group, data = df, var.equal = FALSE)
# Two-way with interaction
aov(y ~ A * B, data = df)
# Repeated measures
aov(value ~ condition + Error(subject/condition), data = df)
mauchly.test(fit) # sphericity
One-way ANOVA (Fisher) validated fields:
| Output field | R field | Tolerance |
|---|---|---|
statistic |
F | 1e-10 |
df_between, df_within |
numerator/denominator df | 1e-10 |
p_value |
p-value | 1e-10 |
ss_between, ss_within |
sum of squares | 1e-10 |
ms_between, ms_within |
mean squares | 1e-10 |
grand_mean |
overall mean | 1e-10 |
Algebraic identity verified: With 2 groups, F = t² (ANOVA F-statistic equals squared t-statistic).
Two-way ANOVA validates SS, df, MS, F, and p-value for Factor A, Factor B, and Interaction (Type I for balanced, Type III for unbalanced designs).
Repeated measures ANOVA additionally validates Mauchly's W (sphericity), Greenhouse-Geisser epsilon, and Huynh-Feldt epsilon.
Non-Parametric Tests¶
Mann-Whitney U¶
Validated against R's wilcox.test() at tolerance 1e-10 (statistic) and 1e-6 (p-value).
| Output field | Tolerance |
|---|---|
statistic (W) |
1e-10 |
p_value (two-sided, less, greater) |
1e-6 |
p_value with continuity correction |
1e-6 |
estimate (Hodges-Lehmann, exact mode) |
1e-10 |
ci_lower, ci_upper (exact, 90%/95%) |
1e-10 |
Wilcoxon Signed-Rank¶
Validated against R's wilcox.test(paired=TRUE) at tolerance 1e-10 / 1e-6.
Kruskal-Wallis¶
Validated against R's kruskal.test() at tolerance 1e-10 (statistic) / 1e-6 (p-value).
Brunner-Munzel¶
Validated against R's brunnermunzel::brunnermunzel.test() at tolerance 1e-6.
| Output field | Tolerance |
|---|---|
statistic |
1e-6 |
df |
1e-6 |
p_value (two-sided, less, greater) |
1e-6 |
estimate (P(X < Y) win probability) |
1e-6 |
ci_lower, ci_upper (90%, 95%, 99%) |
1e-6 |
Correlation¶
Pearson¶
Validated against R's cor.test(method="pearson") at tolerance 1e-10.
| Output field | Tolerance |
|---|---|
estimate (r) |
1e-10 |
statistic (t) |
1e-10 |
df |
1e-10 |
p_value |
1e-6 |
ci_lower, ci_upper (90%, 95%, 99%) |
1e-4 |
Special cases validated: perfect positive (r=1), perfect negative (r=-1), zero correlation.
Spearman¶
Validated against R's cor.test(method="spearman") at tolerance 1e-10 (rho).
Special cases validated: perfect, negative, zero correlation, tied ranks.
Kendall¶
Validated against R's cor.test(method="kendall") at tolerance 1e-4 (tau).
Categorical Tests¶
Chi-Square Test¶
Validated against R's chisq.test() at tolerance 1e-10.
# Test of independence (2x2)
chisq.test(matrix(c(10, 20, 30, 40), nrow = 2), correct = FALSE)
# With Yates' correction
chisq.test(matrix(c(10, 20, 30, 40), nrow = 2), correct = TRUE)
# Goodness of fit (uniform)
chisq.test(c(20, 25, 22, 18, 15))
# Goodness of fit (custom proportions)
chisq.test(c(50, 30, 20), p = c(0.5, 0.3, 0.2))
Fisher's Exact Test¶
Validated against R's fisher.test() at tolerance 0.01 (p-value).
Effect Size Measures¶
| Measure | R Package | Tolerance |
|---|---|---|
| Cramer's V | rcompanion::cramerV() |
1e-4 |
| Phi coefficient | psych::phi() |
1e-4 |
| Contingency coefficient | vcd::assocstats() |
1e-4 |
| Cohen's kappa | irr::kappa2() |
0.01 |
Cohen's kappa example (manually verified):
Confusion matrix: [[20, 5, 0], [10, 30, 5], [0, 5, 25]]
Po = (20+30+25)/100 = 0.75
Pe = (35×30 + 45×40 + 20×30) / 10000 = 0.345
kappa = (0.75 - 0.345) / (1 - 0.345) = 0.618
Distributional Tests¶
Shapiro-Wilk¶
Validated against R's shapiro.test().
| Scenario | W tolerance | p-value tolerance |
|---|---|---|
| Normal data (small/medium/large n) | 1e-3 | 0.05 |
| Non-normal (uniform, exponential) | 1e-3 | 0.05 |
| Edge cases (n=3) | 1e-3 | 0.05 |
| Edge cases (n=4, n=5) | 0.02 | directional only |
| Constant data | 1e-10 (W=1.0) | 1e-10 (p=1.0) |
Tolerance rationale
Shapiro-Wilk has known implementation variations across software packages. The W statistic typically matches to 3-4 decimal places between R and other implementations. polars-statistics uses the same algorithm but may differ slightly in coefficient tables.
D'Agostino K-Squared¶
Validated against R's moments::agostino.test().
| Scenario | z-score tolerance | p-value |
|---|---|---|
| Normal data | 0.15 | > 0.05 (not significant) |
| Skewed data | 0.15 | < 0.05 (significant) |
Regression¶
OLS and WLS¶
Weighted Least Squares¶
Validated against R's lm() with weights at tolerance 0.01 (relative).
| Output field | R value | Tolerance |
|---|---|---|
intercept |
3.1055 | 0.01 |
coefficients[0] |
1.7870 | 0.01 |
se_intercept |
0.6773 | 0.01 |
se_coefficients[0] |
0.1913 | 0.01 |
r_squared |
0.8289 | 0.01 |
HC Standard Errors¶
Validated against R's sandwich::vcovHC() and lmtest::coeftest() at tolerance 0.01 (relative).
library(sandwich)
library(lmtest)
x <- c(1, 2, 3, 4, 5)
y <- c(2.1, 4.3, 5.8, 8.2, 9.9)
m <- lm(y ~ x)
# HC0 through HC3
sqrt(diag(vcovHC(m, type = "HC0"))) # [0.12237, 0.03178]
sqrt(diag(vcovHC(m, type = "HC1"))) # [0.15796, 0.04103]
sqrt(diag(vcovHC(m, type = "HC2"))) # [0.15489, 0.04185]
sqrt(diag(vcovHC(m, type = "HC3"))) # [0.20343, 0.05722]
Exact R reference values (with-intercept model):
| HC Type | Intercept SE | Slope SE |
|---|---|---|
| HC0 | 0.12237 | 0.03178 |
| HC1 | 0.15796 | 0.04103 |
| HC2 | 0.15489 | 0.04185 |
| HC3 | 0.20343 | 0.05722 |
No-intercept model (validated separately):
| HC Type | Slope SE |
|---|---|
| HC0 | 0.02377 |
| HC1 | 0.02658 |
| HC2 | 0.02817 |
| HC3 | 0.03408 |
Properties verified:
- HC3 >= HC2 >= HC0 ordering for standard errors
- t-statistic = coefficient / SE identity (tolerance 0.001)
- p-values in [0, 1]
- Confidence intervals bracket the coefficient
Ridge Regression¶
Validated against R's glmnet with alpha=0 at tolerance 0.01 (relative).
| Output field | R value | True value |
|---|---|---|
intercept |
-0.0145 | 0.0 |
coefficients[0] |
1.5357 | 1.5 |
coefficients[1] |
-2.0012 | -2.0 |
coefficients[2] |
0.4734 | 0.5 |
Elastic Net¶
Validated against R's glmnet with alpha=0.5 at tolerance 0.2 (relative).
| Output field | R value | True value |
|---|---|---|
intercept |
-0.0339 | 0.0 |
coefficients[0] |
1.4658 | 1.5 |
coefficients[1] |
-1.9413 | -2.0 |
coefficients[2] |
0.4016 | 0.5 |
Elastic Net tolerance
Elastic Net uses coordinate descent, and implementations may vary slightly in convergence criteria. A relative tolerance of 0.2 accommodates these differences.
Quantile Regression¶
Validated against R's quantreg::rq() (version 6.1).
Median regression (tau=0.5), n=50:
| Output field | R value | Tolerance |
|---|---|---|
intercept |
2.6995 | 1.0 |
coefficients[0] |
1.2375 | 0.2 |
Multiple quantiles, n=100:
| Tau | R intercept | R slope | Slope tolerance |
|---|---|---|---|
| 0.25 | -0.3861 | 1.9995 | 0.2 |
| 0.50 | 0.8117 | 2.0102 | 0.2 |
| 0.75 | 1.5250 | 2.2169 | 0.3 |
Quantile regression tolerance
Quantile regression solutions are not unique when the number of observations on the solution hyperplane exceeds the number of parameters. Different simplex implementations may find different valid solutions, so intercept tolerance is wider (1.0) while slope tolerance is tighter (0.2).
Prediction and Confidence Intervals¶
Validated against R's predict.lm() with interval="prediction" and interval="confidence".
Large extrapolation test (n=20, predicting at x=100, 500, 1000):
| x | Fitted | PI lower | PI upper | Tolerance (fit) | Tolerance (bounds) |
|---|---|---|---|---|---|
| 100 | 205.578 | 200.856 | 210.301 | 0.1 | 0.2 |
| 500 | 1007.984 | 983.179 | 1032.790 | 0.1 | 0.5 |
| 1000 | 2010.992 | 1960.904 | 2061.080 | 0.1 | 1.0 |
Multiple confidence levels validated at x=35 (fit=80.159):
| Level | PI lower | PI upper |
|---|---|---|
| 80% | 78.916 | 81.401 |
| 90% | 78.548 | 81.769 |
| 95% | 78.219 | 82.098 |
| 99% | 77.542 | 82.775 |
Generalized Linear Models¶
Poisson GLM¶
Validated against R's glm(family=poisson()).
Poisson log link (n=30):
| Output field | R value | Tolerance |
|---|---|---|
intercept |
0.7992 | 0.01 |
coefficients[0] |
0.7108 | 0.01 |
deviance |
28.755 | 0.1 |
se_intercept |
0.1813 | 0.01 |
se_coefficients[0] |
0.0828 | 0.01 |
Poisson identity link (n=20):
| Output field | R value | Tolerance |
|---|---|---|
intercept |
4.0250 | 0.01 |
coefficients[0] |
3.0167 | 0.01 |
deviance |
2.063 | 0.1 |
Binomial GLM¶
Validated against R's glm(family=binomial()) across three link functions.
Logit link (n=30):
| Output field | R value | Tolerance |
|---|---|---|
intercept |
0.0000 | 0.1 |
coefficients[0] |
0.8483 | 0.1 |
deviance |
30.110 | 0.5 |
se_intercept |
0.4502 | 0.05 |
se_coefficients[0] |
0.3062 | 0.05 |
Probit link (n=30):
| Output field | R value | Tolerance |
|---|---|---|
intercept |
0.3428 | 0.1 |
coefficients[0] |
0.9199 | 0.1 |
deviance |
18.956 | 0.5 |
Negative Binomial GLM¶
Validated against R's MASS::glm.nb().
| Output field | R value | Tolerance |
|---|---|---|
intercept |
1.0157 | 0.05 |
coefficients[0] |
0.0532 | 0.05 |
theta |
5.5678 | sign/positivity check |
Gamma GLM¶
Validated against R's glm(family=Gamma(link="log")).
| Output field | R value | Tolerance |
|---|---|---|
intercept |
0.6335 | 0.1 |
coefficients[0] |
0.2899 | 0.1 |
deviance |
19.153 | 1.0 |
se_intercept |
0.3433 | 0.05 |
se_coefficients[0] |
0.1805 | 0.05 |
Inverse Gaussian GLM¶
Validated against R's glm(family=inverse.gaussian(link="log")).
| Output field | R value | Tolerance |
|---|---|---|
intercept |
0.5425 | 0.2 |
coefficients[0] |
0.0365 | 0.2 |
deviance |
19.642 | 2.0 |
Poisson with Offset¶
Validated against R's glm() with offset term for exposure adjustment.
| Output field | R value | Tolerance |
|---|---|---|
intercept |
0.2382 | 0.05 |
coefficients[0] |
0.1003 | 0.01 |
deviance |
21.571 | 0.5 |
Tweedie GLMs¶
Validated against R's statmod::tweedie() family across four variance powers.
| Variance Power | Distribution | Intercept | Slope | Deviance |
|---|---|---|---|---|
| 1.0 | Poisson | 0.7992 | 0.7108 | 28.755 |
| 1.5 | Compound Poisson-Gamma | 0.1820 | 0.2193 | 47.805 |
| 2.0 | Gamma | 0.6335 | 0.2899 | 19.153 |
| 3.0 | Inverse Gaussian | 0.5425 | 0.0365 | 19.642 |
TOST Equivalence Tests¶
Validated against R's TOSTER package (TOSTone(), TOSTtwo(), TOSTpaired(), TOSTr(), TOSTtwo.prop(), wilcox_TOST(), boot_t_TOST()).
Tests verify both equivalence detection and non-equivalence detection:
| Scenario | Bound | Expected result |
|---|---|---|
| One-sample, mean near 0 | ±0.5 | equivalent = true |
| One-sample, mean near 2 | ±0.5 | equivalent = false |
| Two-sample, similar means | ±0.5 | equivalent = true |
| Two-sample, different means | ±0.5 | equivalent = false |
| Proportion, 52/100 vs 0.5 | ±0.1 | estimate = 0.02 (exact) |
| Cohen's d bounds | d = 0.5 | bounds computed correctly |
| Bootstrap TOST | ±0.5 | reproducible with same seed |
R Packages Used¶
The full list of R packages referenced in validation:
| R Package | What it validates |
|---|---|
stats (base R) |
t.test(), aov(), oneway.test(), lm(), glm(), predict(), shapiro.test(), chisq.test(), fisher.test(), cor.test(), wilcox.test(), kruskal.test() |
sandwich |
HC0-HC3 heteroskedasticity-consistent standard errors |
lmtest |
coeftest() for robust inference |
glmnet |
Ridge (alpha=0) and Elastic Net regression |
quantreg |
Quantile regression (rq()) |
MASS |
Negative Binomial GLM (glm.nb()) |
statmod |
Tweedie GLM family |
car |
Brown-Forsythe / Levene's test |
WRS2 |
Yuen's trimmed means test |
brunnermunzel |
Brunner-Munzel test |
TOSTER |
TOST equivalence tests |
rcompanion |
Cramer's V |
vcd |
Contingency coefficient |
psych |
Phi coefficient, Cohen's kappa |
irr |
Cohen's kappa |
moments |
D'Agostino K-squared test |
coin |
Permutation tests |
Summary¶
| Category | Tests validated | Typical tolerance |
|---|---|---|
| Parametric tests | t-tests (Welch, Student, paired), Yuen, Brown-Forsythe, ANOVA (one-way, two-way, repeated measures, Welch) | 1e-10 |
| Non-parametric tests | Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, Brunner-Munzel | 1e-6 to 1e-10 |
| Correlation | Pearson, Spearman, Kendall | 1e-4 to 1e-10 |
| Categorical tests | Chi-square, Fisher exact, Cramer's V, Phi, Contingency coefficient, Cohen's kappa, G-test | 1e-4 to 1e-10 |
| Distributional tests | Shapiro-Wilk, D'Agostino K-squared | 1e-3 |
| Linear regression | OLS, WLS, Ridge, Elastic Net, Quantile | 0.01 to 0.2 |
| HC standard errors | HC0, HC1, HC2, HC3 | 0.01 |
| Prediction intervals | Prediction and confidence intervals at multiple levels | 0.01 to 1.0 |
| GLMs | Poisson (log/identity/sqrt), Binomial (logit/probit/cloglog), Gamma, Inverse Gaussian, Negative Binomial, Tweedie (4 powers) | 0.01 to 0.5 |
| Equivalence tests | TOST one-sample, two-sample, paired, proportion, correlation, Wilcoxon, bootstrap | 1e-6 |