Correlation Tests¶
Tests for measuring and testing linear, monotonic, and general associations between variables.
Validation: All correlation tests are validated against R implementations (cor.test, ppcor, energy packages).
pearson¶
Pearson product-moment correlation coefficient with hypothesis test.
Measures the strength and direction of the linear relationship between two continuous variables. Ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship.
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
Null hypothesis: True correlation is zero (no linear association).
Assumptions: - Continuous variables - Linear relationship - Bivariate normal distribution (for inference) - No extreme outliers
Interpretation: | |r| | Strength | |-----|----------| | 0.0-0.1 | Negligible | | 0.1-0.3 | Weak | | 0.3-0.5 | Moderate | | 0.5-0.7 | Strong | | 0.7-1.0 | Very strong |
Example:
# Basic correlation
df.select(ps.pearson("x", "y").alias("cor"))
# Per-group correlation
df.group_by("category").agg(
ps.pearson("sales", "advertising").alias("correlation")
)
spearman¶
Spearman rank correlation coefficient with hypothesis test.
Measures the strength and direction of the monotonic relationship between two variables using ranks. More robust than Pearson to outliers and non-linear monotonic relationships.
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
Null hypothesis: No monotonic association between variables.
When to use: - Ordinal data - Non-linear but monotonic relationships - Data with outliers - When normality assumption is violated
Example:
kendall¶
Kendall's tau correlation coefficient with hypothesis test.
A non-parametric measure of association based on concordant and discordant pairs. More robust than Spearman for small samples and handles ties well.
ps.kendall(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
variant: str = "b", # "a", "b", or "c"
) -> pl.Expr
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
Variants:
- "a" (tau-a): Does not adjust for ties; use only when no ties exist
- "b" (tau-b): Adjusts for ties; most commonly used (default)
- "c" (tau-c): Stuart's tau-c for rectangular contingency tables
When to use: - Small sample sizes - Many tied values - Ordinal data - When you need a more interpretable measure (proportion of concordant vs discordant pairs)
Example:
# Kendall's tau-b (default)
df.select(ps.kendall("rank_a", "rank_b"))
# For data with no ties
df.select(ps.kendall("x", "y", variant="a"))
distance_cor¶
Distance correlation with permutation test.
Measures both linear and nonlinear associations between variables. Unlike Pearson and Spearman, distance correlation equals zero if and only if the variables are statistically independent.
ps.distance_cor(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
n_permutations: int = 999,
seed: int | None = None,
) -> pl.Expr
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
Key property: Distance correlation = 0 ⟺ X and Y are independent.
This is a fundamental advantage over Pearson/Spearman, which can be zero even when variables are dependent (e.g., quadratic relationships).
When to use: - Detecting any type of association (linear, nonlinear, complex) - When you suspect nonlinear relationships - For exploratory analysis when relationship form is unknown
Note: Computationally more expensive than Pearson/Spearman. The permutation test provides the p-value.
Example:
# Detect any association
df.select(ps.distance_cor("x", "y", n_permutations=999))
# Reproducible result
df.select(ps.distance_cor("x", "y", seed=42))
partial_cor¶
Partial correlation controlling for one or more covariates.
Measures the association between two variables after removing the linear effects of other variables. Useful for understanding direct relationships while controlling for confounders.
ps.partial_cor(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
covariates: list[Union[pl.Expr, str]],
) -> pl.Expr
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
Interpretation: The correlation between X and Y that remains after accounting for the linear influence of the covariates on both X and Y.
When to use: - Controlling for confounding variables - Understanding direct vs indirect relationships - Causal inference (with appropriate design)
Example:
# Correlation between x and y, controlling for z1 and z2
df.select(ps.partial_cor("x", "y", ["z1", "z2"]))
# Control for age when examining income-happiness relationship
df.select(ps.partial_cor("income", "happiness", ["age", "education"]))
semi_partial_cor¶
Semi-partial (part) correlation.
Controls for covariates in Y only, while leaving X uncontrolled. Measures the unique contribution of X to Y after other predictors have been accounted for.
ps.semi_partial_cor(
x: Union[pl.Expr, str],
y: Union[pl.Expr, str],
covariates: list[Union[pl.Expr, str]],
) -> pl.Expr
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
Difference from partial correlation: - Partial: Controls for covariates in both X and Y - Semi-partial: Controls for covariates in Y only
When to use: - Assessing unique contribution of a predictor in regression - The squared semi-partial correlation equals the increase in R² when adding X to a model with the covariates
icc¶
Intraclass Correlation Coefficient (ICC) for reliability and agreement.
Measures the consistency or agreement of measurements made by different raters or at different times. Essential for assessing inter-rater reliability and measurement consistency.
ps.icc(
values: Union[pl.Expr, str],
icc_type: str = "icc1", # "icc1", "icc2", "icc3", "icc2k", "icc3k"
conf_level: float = 0.95,
) -> pl.Expr
Returns: Struct{estimate: Float64, statistic: Float64, p_value: Float64, ci_lower: Float64, ci_upper: Float64, n: UInt32}
ICC Types:
| Type | Model | Definition | Use Case |
|---|---|---|---|
icc1 |
One-way random | Absolute agreement | Different raters for each subject |
icc2 |
Two-way random | Absolute agreement | Same raters, raters are random sample |
icc3 |
Two-way mixed | Consistency | Same raters, raters are fixed |
icc2k |
Two-way random | Mean of k raters | Reliability of average ratings |
icc3k |
Two-way mixed | Mean of k raters | Reliability of average ratings |
Interpretation: | ICC | Reliability | |-----|-------------| | < 0.50 | Poor | | 0.50-0.75 | Moderate | | 0.75-0.90 | Good | | > 0.90 | Excellent |
Choosing a Correlation Measure¶
| Situation | Recommended |
|---|---|
| Linear relationship, continuous data | pearson |
| Monotonic relationship, ordinal data | spearman |
| Small samples, many ties | kendall |
| Unknown relationship type | distance_cor |
| Control for confounders | partial_cor |
| Unique contribution in regression | semi_partial_cor |
| Inter-rater reliability | icc |
See Also¶
- TOST Equivalence Tests -
tost_correlationfor testing if correlation is practically zero - Parametric Tests - Tests for comparing means