Correlation Tests¶

Tests for measuring and testing linear, monotonic, and general associations between variables.

Validation: All correlation tests are validated against R implementations (cor.test, ppcor, energy packages).

`pearson`¶

Pearson product-moment correlation coefficient with hypothesis test.

Measures the strength and direction of the linear relationship between two continuous variables. Ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship.

ps.pearson(
    x: Union[pl.Expr, str],
    y: Union[pl.Expr, str],
    conf_level: float = 0.95,
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

Null hypothesis: True correlation is zero (no linear association).

Assumptions: - Continuous variables - Linear relationship - Bivariate normal distribution (for inference) - No extreme outliers

Interpretation: | |r| | Strength | |-----|----------| | 0.0-0.1 | Negligible | | 0.1-0.3 | Weak | | 0.3-0.5 | Moderate | | 0.5-0.7 | Strong | | 0.7-1.0 | Very strong |

Example:

# Basic correlation
df.select(ps.pearson("x", "y").alias("cor"))

# Per-group correlation
df.group_by("category").agg(
    ps.pearson("sales", "advertising").alias("correlation")
)

`spearman`¶

Spearman rank correlation coefficient with hypothesis test.

Measures the strength and direction of the monotonic relationship between two variables using ranks. More robust than Pearson to outliers and non-linear monotonic relationships.

ps.spearman(
    x: Union[pl.Expr, str],
    y: Union[pl.Expr, str],
    conf_level: float = 0.95,
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

Null hypothesis: No monotonic association between variables.

When to use: - Ordinal data - Non-linear but monotonic relationships - Data with outliers - When normality assumption is violated

Example:

# Rank correlation
df.select(ps.spearman("satisfaction_rating", "purchase_frequency"))

`kendall`¶

Kendall's tau correlation coefficient with hypothesis test.

A non-parametric measure of association based on concordant and discordant pairs. More robust than Spearman for small samples and handles ties well.

ps.kendall(
    x: Union[pl.Expr, str],
    y: Union[pl.Expr, str],
    variant: str = "b",  # "a", "b", or "c"
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

Variants: - "a" (tau-a): Does not adjust for ties; use only when no ties exist - "b" (tau-b): Adjusts for ties; most commonly used (default) - "c" (tau-c): Stuart's tau-c for rectangular contingency tables

When to use: - Small sample sizes - Many tied values - Ordinal data - When you need a more interpretable measure (proportion of concordant vs discordant pairs)

Example:

# Kendall's tau-b (default)
df.select(ps.kendall("rank_a", "rank_b"))

# For data with no ties
df.select(ps.kendall("x", "y", variant="a"))

`distance_cor`¶

Distance correlation with permutation test.

Measures both linear and nonlinear associations between variables. Unlike Pearson and Spearman, distance correlation equals zero if and only if the variables are statistically independent.

ps.distance_cor(
    x: Union[pl.Expr, str],
    y: Union[pl.Expr, str],
    n_permutations: int = 999,
    seed: int | None = None,
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

Key property: Distance correlation = 0 ⟺ X and Y are independent.

This is a fundamental advantage over Pearson/Spearman, which can be zero even when variables are dependent (e.g., quadratic relationships).

When to use: - Detecting any type of association (linear, nonlinear, complex) - When you suspect nonlinear relationships - For exploratory analysis when relationship form is unknown

Note: Computationally more expensive than Pearson/Spearman. The permutation test provides the p-value.

Example:

# Detect any association
df.select(ps.distance_cor("x", "y", n_permutations=999))

# Reproducible result
df.select(ps.distance_cor("x", "y", seed=42))

`partial_cor`¶

Partial correlation controlling for one or more covariates.

Measures the association between two variables after removing the linear effects of other variables. Useful for understanding direct relationships while controlling for confounders.

ps.partial_cor(
    x: Union[pl.Expr, str],
    y: Union[pl.Expr, str],
    covariates: list[Union[pl.Expr, str]],
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

Interpretation: The correlation between X and Y that remains after accounting for the linear influence of the covariates on both X and Y.

When to use: - Controlling for confounding variables - Understanding direct vs indirect relationships - Causal inference (with appropriate design)

Example:

# Correlation between x and y, controlling for z1 and z2
df.select(ps.partial_cor("x", "y", ["z1", "z2"]))

# Control for age when examining income-happiness relationship
df.select(ps.partial_cor("income", "happiness", ["age", "education"]))

`semi_partial_cor`¶

Semi-partial (part) correlation.

Controls for covariates in Y only, while leaving X uncontrolled. Measures the unique contribution of X to Y after other predictors have been accounted for.

ps.semi_partial_cor(
    x: Union[pl.Expr, str],
    y: Union[pl.Expr, str],
    covariates: list[Union[pl.Expr, str]],
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

Difference from partial correlation: - Partial: Controls for covariates in both X and Y - Semi-partial: Controls for covariates in Y only

When to use: - Assessing unique contribution of a predictor in regression - The squared semi-partial correlation equals the increase in R² when adding X to a model with the covariates

`icc`¶

Intraclass Correlation Coefficient (ICC) for reliability and agreement.

Measures the consistency or agreement of measurements made by different raters or at different times. Essential for assessing inter-rater reliability and measurement consistency.

ps.icc(
    values: Union[pl.Expr, str],
    icc_type: str = "icc1",  # "icc1", "icc2", "icc3", "icc2k", "icc3k"
    conf_level: float = 0.95,
) -> pl.Expr

Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}

ICC Types:

Type	Model	Definition	Use Case
`icc1`	One-way random	Absolute agreement	Different raters for each subject
`icc2`	Two-way random	Absolute agreement	Same raters, raters are random sample
`icc3`	Two-way mixed	Consistency	Same raters, raters are fixed
`icc2k`	Two-way random	Mean of k raters	Reliability of average ratings
`icc3k`	Two-way mixed	Mean of k raters	Reliability of average ratings

Interpretation: | ICC | Reliability | |-----|-------------| | < 0.50 | Poor | | 0.50-0.75 | Moderate | | 0.75-0.90 | Good | | > 0.90 | Excellent |

Choosing a Correlation Measure¶

Situation	Recommended
Linear relationship, continuous data	`pearson`
Monotonic relationship, ordinal data	`spearman`
Small samples, many ties	`kendall`
Unknown relationship type	`distance_cor`
Control for confounders	`partial_cor`
Unique contribution in regression	`semi_partial_cor`
Inter-rater reliability	`icc`

Correlation Tests¶

pearson¶

spearman¶

kendall¶

distance_cor¶

partial_cor¶

semi_partial_cor¶

icc¶

Choosing a Correlation Measure¶

See Also¶

`pearson`¶

`spearman`¶

`kendall`¶

`distance_cor`¶

`partial_cor`¶

`semi_partial_cor`¶

`icc`¶