R Validation¶

polars-statistics is validated against R's reference implementations at the Rust engine level. Every statistical test, regression model, and GLM is tested against exact values produced by R scripts using set.seed() for reproducibility.

Validation Architecture¶

polars-statistics (Python/Polars)
    └── anofox-regression (Rust crate)  ← regression/GLM validation
    └── anofox-statistics (Rust crate)  ← statistical tests validation
            └── R scripts with set.seed(42)
            └── CSV reference data generated by R
            └── Rust tests assert against R values

The R validation lives in two upstream Rust crates:

anofox-regression — OLS, WLS, Ridge, Elastic Net, Quantile, GLMs (Poisson, Binomial, Gamma, Inverse Gaussian, Negative Binomial, Tweedie)
anofox-statistics — t-tests, ANOVA, Mann-Whitney, Wilcoxon, Kruskal-Wallis, Brunner-Munzel, Pearson/Spearman/Kendall, chi-square, Fisher, Shapiro-Wilk, and more

Reference values are generated by R scripts (Rscript R/generate_refs.R, generate_regression_validation.R, generate_glm_validation.R) and stored as CSV files or Rust constants. Rust tests then assert that computed values match R within strict tolerances.

Statistical Tests¶

Parametric Tests¶

T-Tests¶

Validated against R's t.test() at tolerance 1e-10.

Rpolars-statistics

# Welch t-test (default)
t.test(g1, g2, var.equal = FALSE)

# Student t-test
t.test(g1, g2, var.equal = TRUE)

# Paired t-test
t.test(g1, g2, paired = TRUE)

# One-sided
t.test(g1, g2, alternative = "less")
t.test(g1, g2, alternative = "greater")

# With location shift
t.test(g1, g2, mu = 0.5)

import polars as pl
import polars_statistics as ps

df = pl.DataFrame({"g1": [...], "g2": [...]})

# Welch t-test (default)
df.select(ps.ttest_ind("g1", "g2").alias("t"))

# Paired t-test
df.select(ps.ttest_paired("g1", "g2").alias("t"))

# One-sided
df.select(ps.ttest_ind("g1", "g2", alternative="less").alias("t"))

# With location shift
df.select(ps.ttest_ind("g1", "g2", mu=0.5).alias("t"))

What is validated:

Output field	R field	Tolerance
`statistic`	t statistic	1e-10
`df`	degrees of freedom	1e-10
`p_value`	p-value (two-sided, less, greater)	1e-10
`mean_x`, `mean_y`	sample means	1e-10
`ci_lower`, `ci_upper`	confidence intervals (90%, 95%, 99%)	1e-6

Yuen's Trimmed Means Test¶

Validated against R's WRS2::yuend() at tolerance 1e-10.

Rpolars-statistics

library(WRS2)
yuend(g1, g2, tr = 0.2)  # 20% trim
yuend(g1, g2, tr = 0.1)  # 10% trim

df.select(ps.yuen_test("g1", "g2", trim=0.2).alias("yuen"))
df.select(ps.yuen_test("g1", "g2", trim=0.1).alias("yuen"))

Output field	Tolerance
`statistic`	1e-10
`df`	1e-10
`p_value` (two-sided, less, greater)	1e-10
`estimate` (trimmed mean difference)	1e-10
`ci_lower`, `ci_upper` (95% CI)	1e-6

Brown-Forsythe Test¶

Validated against R's car::leveneTest(center="median") at tolerance 1e-10.

Rpolars-statistics

library(car)
leveneTest(y ~ group, data = df, center = "median")

df.select(ps.brown_forsythe("g1", "g2").alias("bf"))
df.select(ps.brown_forsythe("g1", "g2", "g3").alias("bf"))

Output field	Tolerance
`statistic` (F)	1e-10
`df1`, `df2`	1e-10
`p_value`	1e-10

ANOVA¶

Validated against R's aov(), oneway.test(), and mauchly.test().

Rpolars-statistics

# One-way (Fisher, assumes equal variances)
oneway.test(y ~ group, data = df, var.equal = TRUE)

# One-way (Welch, no equal-variance assumption)
oneway.test(y ~ group, data = df, var.equal = FALSE)

# Two-way with interaction
aov(y ~ A * B, data = df)

# Repeated measures
aov(value ~ condition + Error(subject/condition), data = df)
mauchly.test(fit)  # sphericity

# One-way ANOVA
df.select(ps.one_way_anova("g1", "g2", "g3").alias("anova"))

# Two-way ANOVA
df.select(ps.two_way_anova("value", "factor_a", "factor_b").alias("anova"))

# Repeated measures ANOVA
df.select(ps.repeated_measures_anova("value", "condition", "subject").alias("rm"))

One-way ANOVA (Fisher) validated fields:

Output field	R field	Tolerance
`statistic`	F	1e-10
`df_between`, `df_within`	numerator/denominator df	1e-10
`p_value`	p-value	1e-10
`ss_between`, `ss_within`	sum of squares	1e-10
`ms_between`, `ms_within`	mean squares	1e-10
`grand_mean`	overall mean	1e-10

Algebraic identity verified: With 2 groups, F = t² (ANOVA F-statistic equals squared t-statistic).

Two-way ANOVA validates SS, df, MS, F, and p-value for Factor A, Factor B, and Interaction (Type I for balanced, Type III for unbalanced designs).

Repeated measures ANOVA additionally validates Mauchly's W (sphericity), Greenhouse-Geisser epsilon, and Huynh-Feldt epsilon.

Non-Parametric Tests¶

Mann-Whitney U¶

Validated against R's wilcox.test() at tolerance 1e-10 (statistic) and 1e-6 (p-value).

Rpolars-statistics

# Asymptotic
wilcox.test(x, y, alternative = "two.sided", correct = FALSE)
wilcox.test(x, y, alternative = "two.sided", correct = TRUE)

# Exact (small samples)
wilcox.test(x, y, exact = TRUE, conf.int = TRUE)

df.select(ps.mann_whitney_u("x", "y").alias("mwu"))
df.select(ps.mann_whitney_u("x", "y", alternative="less").alias("mwu"))

Output field	Tolerance
`statistic` (W)	1e-10
`p_value` (two-sided, less, greater)	1e-6
`p_value` with continuity correction	1e-6
`estimate` (Hodges-Lehmann, exact mode)	1e-10
`ci_lower`, `ci_upper` (exact, 90%/95%)	1e-10

Wilcoxon Signed-Rank¶

Validated against R's wilcox.test(paired=TRUE) at tolerance 1e-10 / 1e-6.

Rpolars-statistics

wilcox.test(x, y, paired = TRUE, alternative = "two.sided")
wilcox.test(x, y, paired = TRUE, exact = TRUE, conf.int = TRUE)

df.select(ps.wilcoxon_signed_rank("x", "y").alias("wsr"))

Kruskal-Wallis¶

Validated against R's kruskal.test() at tolerance 1e-10 (statistic) / 1e-6 (p-value).

Rpolars-statistics

kruskal.test(list(a, b, c))

df.select(ps.kruskal_wallis("a", "b", "c").alias("kw"))

Brunner-Munzel¶

Validated against R's brunnermunzel::brunnermunzel.test() at tolerance 1e-6.

Rpolars-statistics

library(brunnermunzel)
brunnermunzel.test(x, y, alpha = 0.05)

df.select(ps.brunner_munzel("x", "y").alias("bm"))

Output field	Tolerance
`statistic`	1e-6
`df`	1e-6
`p_value` (two-sided, less, greater)	1e-6
`estimate` (P(X < Y) win probability)	1e-6
`ci_lower`, `ci_upper` (90%, 95%, 99%)	1e-6

Correlation¶

Pearson¶

Validated against R's cor.test(method="pearson") at tolerance 1e-10.

Rpolars-statistics

cor.test(x, y, method = "pearson")

df.select(ps.pearson("x", "y").alias("r"))

Output field	Tolerance
`estimate` (r)	1e-10
`statistic` (t)	1e-10
`df`	1e-10
`p_value`	1e-6
`ci_lower`, `ci_upper` (90%, 95%, 99%)	1e-4

Special cases validated: perfect positive (r=1), perfect negative (r=-1), zero correlation.

Spearman¶

Validated against R's cor.test(method="spearman") at tolerance 1e-10 (rho).

Rpolars-statistics

cor.test(x, y, method = "spearman")

df.select(ps.spearman("x", "y").alias("rho"))

Special cases validated: perfect, negative, zero correlation, tied ranks.

Kendall¶

Validated against R's cor.test(method="kendall") at tolerance 1e-4 (tau).

Rpolars-statistics

cor.test(x, y, method = "kendall")

df.select(ps.kendall("x", "y").alias("tau"))

Categorical Tests¶

Chi-Square Test¶

Validated against R's chisq.test() at tolerance 1e-10.

Rpolars-statistics

# Test of independence (2x2)
chisq.test(matrix(c(10, 20, 30, 40), nrow = 2), correct = FALSE)

# With Yates' correction
chisq.test(matrix(c(10, 20, 30, 40), nrow = 2), correct = TRUE)

# Goodness of fit (uniform)
chisq.test(c(20, 25, 22, 18, 15))

# Goodness of fit (custom proportions)
chisq.test(c(50, 30, 20), p = c(0.5, 0.3, 0.2))

# Test of independence
df.select(ps.chisq_test("counts", n_rows=2, n_cols=2).alias("chi"))

# Goodness of fit
df.select(ps.chisq_goodness_of_fit("observed").alias("gof"))

Fisher's Exact Test¶

Validated against R's fisher.test() at tolerance 0.01 (p-value).

Rpolars-statistics

fisher.test(matrix(c(3, 1, 1, 3), nrow = 2))

pl.select(ps.fisher_exact(a=3, b=1, c=1, d=3).alias("f"))

Effect Size Measures¶

Measure	R Package	Tolerance
Cramer's V	`rcompanion::cramerV()`	1e-4
Phi coefficient	`psych::phi()`	1e-4
Contingency coefficient	`vcd::assocstats()`	1e-4
Cohen's kappa	`irr::kappa2()`	0.01

Cohen's kappa example (manually verified):

Confusion matrix:  [[20, 5, 0], [10, 30, 5], [0, 5, 25]]
Po = (20+30+25)/100 = 0.75
Pe = (35×30 + 45×40 + 20×30) / 10000 = 0.345
kappa = (0.75 - 0.345) / (1 - 0.345) = 0.618

Distributional Tests¶

Shapiro-Wilk¶

Validated against R's shapiro.test().

Rpolars-statistics

shapiro.test(x)

df.select(ps.shapiro_wilk("x").alias("sw"))

Scenario	W tolerance	p-value tolerance
Normal data (small/medium/large n)	1e-3	0.05
Non-normal (uniform, exponential)	1e-3	0.05
Edge cases (n=3)	1e-3	0.05
Edge cases (n=4, n=5)	0.02	directional only
Constant data	1e-10 (W=1.0)	1e-10 (p=1.0)

Tolerance rationale

Shapiro-Wilk has known implementation variations across software packages. The W statistic typically matches to 3-4 decimal places between R and other implementations. polars-statistics uses the same algorithm but may differ slightly in coefficient tables.

D'Agostino K-Squared¶

Validated against R's moments::agostino.test().

Scenario	z-score tolerance	p-value
Normal data	0.15	> 0.05 (not significant)
Skewed data	0.15	< 0.05 (significant)

Regression¶

OLS and WLS¶

Weighted Least Squares¶

Validated against R's lm() with weights at tolerance 0.01 (relative).

Rpolars-statistics

set.seed(42)
x <- 1:20
y <- 3 + 1.8 * x + rnorm(20, sd = x)  # heteroskedastic
w <- 1 / x^2                           # inverse-variance weights
fit <- lm(y ~ x, weights = w)
summary(fit)

df.select(
    ps.wls("y", "x", weights="w").alias("model")
)

Output field	R value	Tolerance
`intercept`	3.1055	0.01
`coefficients[0]`	1.7870	0.01
`se_intercept`	0.6773	0.01
`se_coefficients[0]`	0.1913	0.01
`r_squared`	0.8289	0.01

HC Standard Errors¶

Validated against R's sandwich::vcovHC() and lmtest::coeftest() at tolerance 0.01 (relative).

Rpolars-statistics

library(sandwich)
library(lmtest)

x <- c(1, 2, 3, 4, 5)
y <- c(2.1, 4.3, 5.8, 8.2, 9.9)
m <- lm(y ~ x)

# HC0 through HC3
sqrt(diag(vcovHC(m, type = "HC0")))  # [0.12237, 0.03178]
sqrt(diag(vcovHC(m, type = "HC1")))  # [0.15796, 0.04103]
sqrt(diag(vcovHC(m, type = "HC2")))  # [0.15489, 0.04185]
sqrt(diag(vcovHC(m, type = "HC3")))  # [0.20343, 0.05722]

df.select(
    ps.ols_summary("y", "x", hc_type="hc0").alias("coef"),
    # Also: hc_type="hc1", "hc2", "hc3"
)

Exact R reference values (with-intercept model):

HC Type	Intercept SE	Slope SE
HC0	0.12237	0.03178
HC1	0.15796	0.04103
HC2	0.15489	0.04185
HC3	0.20343	0.05722

No-intercept model (validated separately):

HC Type	Slope SE
HC0	0.02377
HC1	0.02658
HC2	0.02817
HC3	0.03408

Properties verified:

HC3 >= HC2 >= HC0 ordering for standard errors
t-statistic = coefficient / SE identity (tolerance 0.001)
p-values in [0, 1]
Confidence intervals bracket the coefficient

Ridge Regression¶

Validated against R's glmnet with alpha=0 at tolerance 0.01 (relative).

Rpolars-statistics

library(glmnet)
set.seed(42)

n <- 50; p <- 3
X <- matrix(rnorm(n * p), n, p)
beta <- c(1.5, -2.0, 0.5)
y <- X %*% beta + rnorm(n)

fit <- glmnet(X, y, alpha = 0, lambda = 0.5,
              standardize = FALSE, intercept = TRUE)
coef(fit)

df.select(
    ps.ridge("y", "x0", "x1", "x2", lambda_=0.5).alias("model")
)

Output field	R value	True value
`intercept`	-0.0145	0.0
`coefficients[0]`	1.5357	1.5
`coefficients[1]`	-2.0012	-2.0
`coefficients[2]`	0.4734	0.5

Elastic Net¶

Validated against R's glmnet with alpha=0.5 at tolerance 0.2 (relative).

Rpolars-statistics

fit <- glmnet(X, y, alpha = 0.5, lambda = 0.3,
              standardize = FALSE, intercept = TRUE)
coef(fit)

df.select(
    ps.elastic_net("y", "x0", "x1", "x2",
                   lambda_=0.3, alpha=0.5).alias("model")
)

Output field	R value	True value
`intercept`	-0.0339	0.0
`coefficients[0]`	1.4658	1.5
`coefficients[1]`	-1.9413	-2.0
`coefficients[2]`	0.4016	0.5

Elastic Net tolerance

Elastic Net uses coordinate descent, and implementations may vary slightly in convergence criteria. A relative tolerance of 0.2 accommodates these differences.

Quantile Regression¶

Validated against R's quantreg::rq() (version 6.1).

Rpolars-statistics

library(quantreg)
set.seed(42)

x <- seq(1, 10, length.out = 50)
y <- 3 + 1.2 * x + rnorm(50, sd = x * 0.5)

rq(y ~ x, tau = 0.5)   # median
rq(y ~ x, tau = 0.25)  # lower quartile
rq(y ~ x, tau = 0.75)  # upper quartile

df.select(
    ps.quantile("y", "x", tau=0.50).alias("q50"),
    ps.quantile("y", "x", tau=0.25).alias("q25"),
    ps.quantile("y", "x", tau=0.75).alias("q75"),
)

Median regression (tau=0.5), n=50:

Output field	R value	Tolerance
`intercept`	2.6995	1.0
`coefficients[0]`	1.2375	0.2

Multiple quantiles, n=100:

Tau	R intercept	R slope	Slope tolerance
0.25	-0.3861	1.9995	0.2
0.50	0.8117	2.0102	0.2
0.75	1.5250	2.2169	0.3

Quantile regression tolerance

Quantile regression solutions are not unique when the number of observations on the solution hyperplane exceeds the number of parameters. Different simplex implementations may find different valid solutions, so intercept tolerance is wider (1.0) while slope tolerance is tighter (0.2).

Prediction and Confidence Intervals¶

Validated against R's predict.lm() with interval="prediction" and interval="confidence".

Rpolars-statistics

m <- lm(y ~ x)
predict(m, newdata = data.frame(x = 100),
        interval = "prediction", level = 0.95)
predict(m, newdata = data.frame(x = 100),
        interval = "confidence", level = 0.95)

df.with_columns(
    ps.ols_predict("y", "x", interval="prediction", level=0.95).alias("pred")
).unnest("pred")

Large extrapolation test (n=20, predicting at x=100, 500, 1000):

x	Fitted	PI lower	PI upper	Tolerance (fit)	Tolerance (bounds)
100	205.578	200.856	210.301	0.1	0.2
500	1007.984	983.179	1032.790	0.1	0.5
1000	2010.992	1960.904	2061.080	0.1	1.0

Multiple confidence levels validated at x=35 (fit=80.159):

Level	PI lower	PI upper
80%	78.916	81.401
90%	78.548	81.769
95%	78.219	82.098
99%	77.542	82.775

Generalized Linear Models¶

Poisson GLM¶

Validated against R's glm(family=poisson()).

R (log link)R (identity link)polars-statistics

set.seed(42)
x <- seq(0.1, 3.0, length.out = 30)
y <- rpois(30, lambda = exp(0.8 + 0.7 * x))
fit <- glm(y ~ x, family = poisson(link = "log"))
summary(fit)

x <- 1:20
y <- rpois(20, lambda = 4 + 3 * x)
fit <- glm(y ~ x, family = poisson(link = "identity"))

# Log link (default)
df.select(ps.poisson("y", "x").alias("model"))

# Identity link
df.select(ps.poisson("y", "x", link="identity").alias("model"))

Poisson log link (n=30):

Output field	R value	Tolerance
`intercept`	0.7992	0.01
`coefficients[0]`	0.7108	0.01
`deviance`	28.755	0.1
`se_intercept`	0.1813	0.01
`se_coefficients[0]`	0.0828	0.01

Poisson identity link (n=20):

Output field	R value	Tolerance
`intercept`	4.0250	0.01
`coefficients[0]`	3.0167	0.01
`deviance`	2.063	0.1

Binomial GLM¶

Validated against R's glm(family=binomial()) across three link functions.

R (logit)R (probit)R (cloglog)

set.seed(42)
x <- seq(-3, 3, length.out = 30)
p <- plogis(0 + 0.8 * x)
y <- rbinom(30, 1, p)
fit <- glm(y ~ x, family = binomial(link = "logit"))
summary(fit)

fit <- glm(y ~ x, family = binomial(link = "probit"))

fit <- glm(y ~ x, family = binomial(link = "cloglog"))

Logit link (n=30):

Output field	R value	Tolerance
`intercept`	0.0000	0.1
`coefficients[0]`	0.8483	0.1
`deviance`	30.110	0.5
`se_intercept`	0.4502	0.05
`se_coefficients[0]`	0.3062	0.05

Probit link (n=30):

Output field	R value	Tolerance
`intercept`	0.3428	0.1
`coefficients[0]`	0.9199	0.1
`deviance`	18.956	0.5

Negative Binomial GLM¶

Validated against R's MASS::glm.nb().

Rpolars-statistics

library(MASS)
set.seed(42)
x <- seq(0.5, 3.0, length.out = 30)
y <- rnbinom(30, size = 5, mu = exp(1.0 + 0.05 * x))
fit <- glm.nb(y ~ x)
summary(fit)

df.select(ps.negative_binomial("y", "x").alias("model"))

Output field	R value	Tolerance
`intercept`	1.0157	0.05
`coefficients[0]`	0.0532	0.05
`theta`	5.5678	sign/positivity check

Gamma GLM¶

Validated against R's glm(family=Gamma(link="log")).

Rpolars-statistics

set.seed(42)
x <- seq(0.5, 3.0, length.out = 30)
y <- rgamma(30, shape = 2, rate = 2 / exp(0.6 + 0.3 * x))
fit <- glm(y ~ x, family = Gamma(link = "log"))
summary(fit)

df.select(ps.gamma_glm("y", "x").alias("model"))

Output field	R value	Tolerance
`intercept`	0.6335	0.1
`coefficients[0]`	0.2899	0.1
`deviance`	19.153	1.0
`se_intercept`	0.3433	0.05
`se_coefficients[0]`	0.1805	0.05

Inverse Gaussian GLM¶

Validated against R's glm(family=inverse.gaussian(link="log")).

Output field	R value	Tolerance
`intercept`	0.5425	0.2
`coefficients[0]`	0.0365	0.2
`deviance`	19.642	2.0

Poisson with Offset¶

Validated against R's glm() with offset term for exposure adjustment.

Rpolars-statistics

fit <- glm(y ~ x + offset(log(exposure)),
           family = poisson(link = "log"))

df.select(
    ps.poisson("y", "x", offset="log_exposure").alias("model")
)

Output field	R value	Tolerance
`intercept`	0.2382	0.05
`coefficients[0]`	0.1003	0.01
`deviance`	21.571	0.5

Tweedie GLMs¶

Validated against R's statmod::tweedie() family across four variance powers.

R

library(statmod)
fit <- glm(y ~ x, family = tweedie(var.power = P, link.power = 0))

Variance Power	Distribution	Intercept	Slope	Deviance
1.0	Poisson	0.7992	0.7108	28.755
1.5	Compound Poisson-Gamma	0.1820	0.2193	47.805
2.0	Gamma	0.6335	0.2899	19.153
3.0	Inverse Gaussian	0.5425	0.0365	19.642

TOST Equivalence Tests¶

Validated against R's TOSTER package (TOSTone(), TOSTtwo(), TOSTpaired(), TOSTr(), TOSTtwo.prop(), wilcox_TOST(), boot_t_TOST()).

Tests verify both equivalence detection and non-equivalence detection:

Scenario	Bound	Expected result
One-sample, mean near 0	±0.5	`equivalent = true`
One-sample, mean near 2	±0.5	`equivalent = false`
Two-sample, similar means	±0.5	`equivalent = true`
Two-sample, different means	±0.5	`equivalent = false`
Proportion, 52/100 vs 0.5	±0.1	`estimate = 0.02` (exact)
Cohen's d bounds	d = 0.5	bounds computed correctly
Bootstrap TOST	±0.5	reproducible with same seed

R Packages Used¶

The full list of R packages referenced in validation:

R Package	What it validates
`stats` (base R)	`t.test()`, `aov()`, `oneway.test()`, `lm()`, `glm()`, `predict()`, `shapiro.test()`, `chisq.test()`, `fisher.test()`, `cor.test()`, `wilcox.test()`, `kruskal.test()`
`sandwich`	HC0-HC3 heteroskedasticity-consistent standard errors
`lmtest`	`coeftest()` for robust inference
`glmnet`	Ridge (`alpha=0`) and Elastic Net regression
`quantreg`	Quantile regression (`rq()`)
`MASS`	Negative Binomial GLM (`glm.nb()`)
`statmod`	Tweedie GLM family
`car`	Brown-Forsythe / Levene's test
`WRS2`	Yuen's trimmed means test
`brunnermunzel`	Brunner-Munzel test
`TOSTER`	TOST equivalence tests
`rcompanion`	Cramer's V
`vcd`	Contingency coefficient
`psych`	Phi coefficient, Cohen's kappa
`irr`	Cohen's kappa
`moments`	D'Agostino K-squared test
`coin`	Permutation tests

Summary¶

Category	Tests validated	Typical tolerance
Parametric tests	t-tests (Welch, Student, paired), Yuen, Brown-Forsythe, ANOVA (one-way, two-way, repeated measures, Welch)	1e-10
Non-parametric tests	Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, Brunner-Munzel	1e-6 to 1e-10
Correlation	Pearson, Spearman, Kendall	1e-4 to 1e-10
Categorical tests	Chi-square, Fisher exact, Cramer's V, Phi, Contingency coefficient, Cohen's kappa, G-test	1e-4 to 1e-10
Distributional tests	Shapiro-Wilk, D'Agostino K-squared	1e-3
Linear regression	OLS, WLS, Ridge, Elastic Net, Quantile	0.01 to 0.2
HC standard errors	HC0, HC1, HC2, HC3	0.01
Prediction intervals	Prediction and confidence intervals at multiple levels	0.01 to 1.0
GLMs	Poisson (log/identity/sqrt), Binomial (logit/probit/cloglog), Gamma, Inverse Gaussian, Negative Binomial, Tweedie (4 powers)	0.01 to 0.5
Equivalence tests	TOST one-sample, two-sample, paired, proportion, correlation, Wilcoxon, bootstrap	1e-6