TOST Equivalence Tests¶
Two One-Sided Tests (TOST) for testing practical equivalence. Unlike traditional hypothesis tests that test for difference, TOST tests whether effects are small enough to be considered equivalent.
Validation: All TOST tests are validated against R's TOSTER package and equivalence package implementations.
Understanding TOST¶
The problem with traditional tests: A non-significant result (p > 0.05) does NOT prove equivalence—it only means we failed to detect a difference. Absence of evidence is not evidence of absence.
TOST solution: Define an equivalence region [-δ, +δ] representing "practically equivalent." If we can reject both: - H₀₁: effect ≤ -δ (effect is too negative) - H₀₂: effect ≥ +δ (effect is too positive)
Then we conclude the effect lies within [-δ, +δ], establishing equivalence.
Key insight: TOST has the burden of proof reversed—the null hypothesis is non-equivalence, so we need evidence to claim equivalence.
Common Parameters¶
| Parameter | Description |
|---|---|
bounds_type |
"symmetric" (±delta), "raw" (lower/upper), or "cohen_d" |
delta |
Equivalence margin for symmetric/cohen_d bounds |
lower, upper |
Explicit bounds for raw bounds_type |
alpha |
Significance level (default 0.05) |
Choosing bounds:
- "symmetric": Use when the effect should be within ±delta of zero (or reference)
- "raw": Use when bounds are asymmetric (e.g., -0.3 to +0.5)
- "cohen_d": Use when delta represents a standardized effect size (Cohen's d)
Return Structure¶
All TOST tests return:
Struct{
estimate: Float64, # Point estimate
ci_lower: Float64, # CI lower bound
ci_upper: Float64, # CI upper bound
bound_lower: Float64, # Equivalence lower bound
bound_upper: Float64, # Equivalence upper bound
tost_p_value: Float64, # TOST p-value (max of two one-sided)
equivalent: Boolean, # True if equivalence established
alpha: Float64,
n: UInt32,
}
Decision rule: If tost_p_value < alpha, equivalence is established (equivalent = True).
T-Test Based TOST¶
tost_t_test_one_sample¶
One-sample TOST equivalence test for comparing a mean to a reference value.
Tests whether the population mean is equivalent to a reference value μ (within ±delta).
ps.tost_t_test_one_sample(
x: Union[pl.Expr, str],
mu: float = 0.0,
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The mean differs from μ by more than delta (|mean - μ| > delta).
When to use: - Testing if a batch mean is equivalent to a target value - Validating that a process is operating within acceptable limits - Quality control applications
Example:
# Test if mean is within ±0.5 of target (10.0)
df.select(ps.tost_t_test_one_sample("measurements", mu=10.0, delta=0.5))
tost_t_test_two_sample¶
Two-sample TOST equivalence test for comparing two independent groups.
Tests whether the difference between two group means is practically zero (within ±delta).
ps.tost_t_test_two_sample(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
pooled: bool = False,
) -> pl.Expr
Null hypothesis: The difference between means exceeds the equivalence bounds.
Parameters:
- pooled=False (default): Welch's approach, doesn't assume equal variances
- pooled=True: Student's approach, assumes equal variances
When to use: - Demonstrating bioequivalence (generic vs brand drug) - Showing two methods produce equivalent results - Validating that a change didn't meaningfully affect outcomes
Example:
# Test if treatment and control differ by less than 0.5 units
df.select(ps.tost_t_test_two_sample("treatment", "control", delta=0.5))
# Using Cohen's d for standardized equivalence bounds
df.select(ps.tost_t_test_two_sample("treatment", "control",
bounds_type="cohen_d", delta=0.3))
tost_t_test_paired¶
Paired-samples TOST equivalence test for repeated measures.
Tests whether the mean difference between paired observations is practically zero.
ps.tost_t_test_paired(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The mean paired difference exceeds the equivalence bounds.
When to use: - Before/after studies showing no meaningful change - Method comparison on the same samples - Test-retest reliability assessment
Example:
# Test if method A and B give equivalent measurements
df.select(ps.tost_t_test_paired("method_a", "method_b", delta=0.25))
---
## Correlation TOST
### `tost_correlation`
Correlation TOST equivalence test using Fisher's z-transformation.
Tests whether the correlation between two variables is practically equivalent to a reference value (typically zero). Useful for demonstrating negligible association.
```python
ps.tost_correlation(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
method: str = "pearson", # "pearson" or "spearman"
rho_null: float = 0.0,
bounds_type: str = "symmetric",
delta: float = 0.3,
lower: float = -0.3,
upper: float = 0.3,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The correlation differs from rho_null by more than delta.
Parameters:
- method: "pearson" for linear correlation, "spearman" for monotonic
- rho_null: Reference correlation to test against (default 0)
- delta=0.3: Common choice based on |r| < 0.3 being "weak" correlation
When to use: - Demonstrating independence between variables - Showing confounders have negligible association - Validating discriminant validity (measures should not correlate)
Example:
# Test if correlation is practically zero (|r| < 0.3)
df.select(ps.tost_correlation("x", "y", delta=0.3))
# Test if correlation is equivalent to a target (0.8)
df.select(ps.tost_correlation("scale_a", "scale_b",
rho_null=0.8, delta=0.1))
---
## Proportion TOST
### `tost_prop_one`
One-proportion TOST equivalence test.
Tests whether an observed proportion is practically equivalent to a target proportion.
```python
ps.tost_prop_one(
successes: int,
n: int,
p0: float = 0.5,
bounds_type: str = "symmetric",
delta: float = 0.1,
lower: float = -0.1,
upper: float = 0.1,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The true proportion differs from p0 by more than delta.
When to use: - Validating that a process meets a target rate within tolerance - Quality control: defect rate equivalent to acceptable level - Survey validation: response rate equivalent to expected
Example:
# Test if success rate is within ±10% of 80%
ps.tost_prop_one(successes=82, n=100, p0=0.8, delta=0.1)
tost_prop_two¶
Two-proportion TOST equivalence test.
Tests whether the difference between two proportions is practically zero.
ps.tost_prop_two(
successes1: int,
n1: int,
successes2: int,
n2: int,
bounds_type: str = "symmetric",
delta: float = 0.1,
lower: float = -0.1,
upper: float = 0.1,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The difference between proportions exceeds the equivalence bounds.
When to use: - A/B testing: showing no meaningful difference between variants - Comparing success rates across groups - Non-inferiority/equivalence trials with binary endpoints
Example:
# Test if conversion rates are equivalent (within ±5%)
ps.tost_prop_two(successes1=120, n1=1000,
successes2=115, n2=1000, delta=0.05)
---
## Non-Parametric TOST
### `tost_wilcoxon_paired`
Wilcoxon signed-rank TOST equivalence test for paired samples (non-parametric).
Tests equivalence without assuming normality. Based on the median of paired differences.
```python
ps.tost_wilcoxon_paired(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The pseudo-median of differences exceeds the equivalence bounds.
When to use:
- Paired data with non-normal differences
- Ordinal data or ranks
- Robust alternative to tost_t_test_paired
Note: Bounds refer to the Hodges-Lehmann pseudo-median, not the mean.
tost_wilcoxon_two_sample¶
Wilcoxon rank-sum TOST equivalence test for two independent samples (non-parametric).
Tests equivalence using ranks, robust to non-normality and outliers.
ps.tost_wilcoxon_two_sample(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The location shift between groups exceeds the equivalence bounds.
When to use:
- Non-normal data with outliers
- Ordinal response variables
- Robust alternative to tost_t_test_two_sample
Robust TOST¶
tost_bootstrap¶
Bootstrap TOST equivalence test using resampling for inference.
Distribution-free method that makes no parametric assumptions. Uses bootstrap confidence intervals for equivalence testing.
ps.tost_bootstrap(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
n_bootstrap: int = 1000,
seed: int | None = None,
) -> pl.Expr
Null hypothesis: The mean difference exceeds the equivalence bounds.
Parameters:
- n_bootstrap: Number of bootstrap resamples (default 1000; use 10000 for publication)
- seed: For reproducible results
When to use: - Unknown or complex distribution shapes - When parametric assumptions are questionable - Small samples where asymptotic approximations fail
Note: Computationally more expensive than parametric tests.
tost_yuen¶
Yuen TOST equivalence test comparing trimmed means (robust to outliers).
Uses trimmed means and Winsorized variances for robust equivalence testing.
ps.tost_yuen(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
trim: float = 0.2,
bounds_type: str = "symmetric",
delta: float = 0.5,
lower: float = -0.5,
upper: float = 0.5,
alpha: float = 0.05,
) -> pl.Expr
Null hypothesis: The difference in trimmed means exceeds the equivalence bounds.
Parameters:
- trim=0.2 (default): Remove 20% from each tail, using middle 60%
When to use: - Data contains outliers - Heavy-tailed distributions - Robust inference on central tendency
Choosing a TOST Test¶
| Situation | Recommended Test |
|---|---|
| Normal data, one sample | tost_t_test_one_sample |
| Normal data, two independent groups | tost_t_test_two_sample |
| Normal data, paired measurements | tost_t_test_paired |
| Non-normal, two groups | tost_wilcoxon_two_sample |
| Non-normal, paired | tost_wilcoxon_paired |
| Outliers present | tost_yuen |
| Unknown distribution | tost_bootstrap |
| Correlation equivalence | tost_correlation |
| Proportion equivalence | tost_prop_one, tost_prop_two |
Setting Equivalence Bounds¶
Critical decision: The choice of δ should be made before seeing the data, based on: - Clinical/practical significance thresholds - Regulatory requirements (e.g., bioequivalence uses 80-125% for AUC) - Subject matter expertise
Common conventions: | Domain | Typical δ | |--------|----------| | Cohen's d (standardized) | 0.3-0.5 (small effect) | | Correlation | 0.3 (weak correlation) | | Proportions | 0.05-0.10 (5-10%) | | Bioequivalence | 20% (0.80-1.25 ratio) |
Interpretation¶
- equivalent = True: The effect is within the equivalence bounds at the specified alpha level. Equivalence established.
- equivalent = False: Cannot conclude equivalence. Either:
- The effect is truly outside the bounds, OR
- Sample size insufficient to establish equivalence (low power)
- tost_p_value: The maximum of the two one-sided p-values; reject non-equivalence if < alpha
Important: A non-significant TOST result does NOT mean the groups differ—it means we cannot conclude equivalence with the current data.
See Also¶
- Parametric Tests - Traditional t-tests (for testing difference)
- Non-Parametric Tests - Traditional non-parametric tests
- Correlation Tests - Correlation methods