Distributional Tests¶

Tests for checking distributional properties of data, particularly normality. Use these tests to verify assumptions before applying parametric methods.

Validation: All distributional tests are validated against R implementations (shapiro.test, moments package).

`shapiro_wilk`¶

Shapiro-Wilk test for normality.

The most powerful test for detecting departures from normality in small to medium samples (n < 5000). Based on the correlation between the data and the corresponding normal quantiles.

ps.shapiro_wilk(
    x: Union[pl.Expr, str],
) -> pl.Expr

Returns: Struct{statistic: Float64, p_value: Float64}

Null hypothesis: The data comes from a normally distributed population.

Statistic (W): Ranges from 0 to 1. Values close to 1 indicate normality; values significantly less than 1 indicate non-normality.

Sample size: Best for n = 3 to 5000. For larger samples, even trivial departures from normality become significant.

When to use: - Checking normality assumptions before t-tests, ANOVA, or linear regression - Small to medium samples where power matters - Exploratory data analysis

Limitations: - Highly sensitive with large samples (may reject normality for practically normal data) - Does not identify how the data deviates from normality (skewness vs kurtosis)

Example:

# Check if residuals are normal
df.select(ps.shapiro_wilk("residuals").alias("normality_test"))

# Per-group normality check
df.group_by("treatment").agg(
    ps.shapiro_wilk("response").alias("normality")
)

Interpretation: | p-value | Conclusion | |---------|------------| | > 0.05 | Cannot reject normality; data is consistent with normal distribution | | ≤ 0.05 | Significant deviation from normality; consider non-parametric methods |

Note: Always combine with visual inspection (Q-Q plots, histograms) as p-values alone don't tell the full story.

`dagostino`¶

D'Agostino-Pearson omnibus test for normality.

Tests whether a sample has the skewness and kurtosis matching a normal distribution. Combines separate tests for skewness and kurtosis into an overall test of normality.

ps.dagostino(
    x: Union[pl.Expr, str],
) -> pl.Expr

Returns: Struct{statistic: Float64, p_value: Float64}

Null hypothesis: The data has skewness and kurtosis consistent with a normal distribution.

Statistic (K²): Chi-square distributed with 2 degrees of freedom under the null hypothesis.

Sample size: Requires at least 20 observations. Works well for larger samples where Shapiro-Wilk becomes overly sensitive.

What it detects: - Skewness: Asymmetry in the distribution (left or right tails) - Kurtosis: Heavy or light tails compared to normal (leptokurtic vs platykurtic)

When to use: - Larger samples (n > 50) where Shapiro-Wilk may be too sensitive - When you want to detect departures from both symmetry and tail behavior - Automated normality checking in pipelines

Example:

# Test normality for larger dataset
df.select(ps.dagostino("measurements"))

# Compare normality across conditions
df.group_by("condition").agg(
    ps.dagostino("outcome").alias("normality")
)

Choosing a Normality Test¶

Situation	Recommended Test
Small samples (n < 50)	`shapiro_wilk`
Medium samples (50 ≤ n < 300)	Either test
Large samples (n ≥ 300)	`dagostino` (or rely on CLT)
Need to identify skewness/kurtosis	`dagostino`
Most powerful test	`shapiro_wilk`

Practical advice: - With large samples, minor non-normality is often detected but may not matter practically - The Central Limit Theorem provides robustness for means with n > 30 - Visual inspection (Q-Q plots) often more informative than p-values - Consider the consequences: t-tests and ANOVA are fairly robust to mild non-normality

Distributional Tests¶

shapiro_wilk¶

dagostino¶

Choosing a Normality Test¶

See Also¶

`shapiro_wilk`¶

`dagostino`¶