Advanced Correlation & Reliability¶
When Pearson's r misses the story — nonlinear associations, confounded relationships, and normality assumptions — you need specialized correlation tools. This page covers distance correlation for nonlinear patterns, semi-partial correlation for unique variance decomposition, and D'Agostino's normality test as a diagnostic complement.
Setup¶
Nonlinear Association — Distance Correlation¶
Pearson's r only captures linear relationships. Distance correlation detects any form of statistical dependence — including parabolic, sinusoidal, and other nonlinear patterns.
rng = np.random.default_rng(42)
x = rng.uniform(-3, 3, 50).tolist()
y = [xi**2 + float(rng.normal(0, 0.5)) for xi in x]
df_nl = pl.DataFrame({"x": x, "y": y})
First, check what Pearson's r says:
pearson_result = df_nl.select(
ps.pearson("x", "y").alias("pearson")
)
p = pearson_result["pearson"][0]
print(f"Pearson: r={p['estimate']:.4f}, p={p['p_value']:.4f}")
Expected output:
Pearson sees almost no relationship (r=0.14, not significant). But the data is a noisy parabola — there is clearly a strong pattern.
Now try distance correlation:
dcor_result = df_nl.select(
ps.distance_cor("x", "y", n_permutations=999, seed=42).alias("dcor")
)
dc = dcor_result["dcor"][0]
print(f"Distance cor: r={dc['estimate']:.4f}, p={dc['p_value']:.3f}")
Expected output:
Distance correlation detects the nonlinear relationship (r=0.49, p=0.001) that Pearson completely missed.

Plot code
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(df_nl["x"].to_list(), df_nl["y"].to_list(),
s=50, color="#4C72B0", edgecolor="white", alpha=0.8)
# Overlay true parabola
x_line = np.linspace(-3, 3, 100)
ax.plot(x_line, x_line**2, color="#C44E52", lw=2, ls="--", label="y = x²")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title(f"Pearson r = 0.14 (p=0.34) | Distance r = 0.49 (p=0.001)")
ax.legend()
plt.tight_layout()
plt.savefig("cor2_nonlinear.png", dpi=150)
Unique Contribution — Semi-Partial Correlation¶
In psychometric validation, you often want to know how much unique variance a predictor adds after accounting for other variables. Semi-partial correlation removes covariates from one variable only, preserving the total variance of the other.
df_sp = pl.DataFrame({
"test_score": [85.0, 90.0, 78.0, 92.0, 88.0, 76.0, 95.0, 83.0, 89.0, 91.0,
80.0, 87.0, 93.0, 84.0, 86.0, 79.0, 94.0, 82.0, 88.0, 90.0],
"study_hours": [6.0, 8.0, 4.0, 9.0, 7.0, 3.0, 10.0, 5.0, 7.0, 8.0,
5.0, 6.0, 9.0, 5.0, 6.0, 4.0, 10.0, 5.0, 7.0, 8.0],
"sleep_hours": [7.0, 8.0, 6.0, 8.0, 7.0, 5.0, 9.0, 6.0, 7.0, 8.0,
6.0, 7.0, 8.0, 7.0, 7.0, 5.0, 9.0, 6.0, 7.0, 8.0],
"iq": [110.0, 120.0, 100.0, 125.0, 115.0, 95.0, 130.0, 105.0, 118.0, 122.0,
102.0, 112.0, 128.0, 108.0, 111.0, 98.0, 132.0, 103.0, 116.0, 120.0],
})
Semi-Partial Correlation¶
What unique variance does study_hours add to test_score prediction, beyond what sleep and IQ already explain?
result = df_sp.select(
ps.semi_partial_cor("test_score", "study_hours", ["sleep_hours", "iq"]).alias("sp")
)
sp = result["sp"][0]
print(f"Semi-partial r: {sp['estimate']:.4f}, p={sp['p_value']:.3f}")
Expected output:
Partial Correlation¶
For comparison, the partial correlation (which removes covariates from both variables):
result = df_sp.select(
ps.partial_cor("test_score", "study_hours", ["sleep_hours", "iq"]).alias("pc")
)
pc = result["pc"][0]
print(f"Partial r: {pc['estimate']:.4f}, p={pc['p_value']:.3f}")
Expected output:
Both correlations are effectively zero. After controlling for sleep and IQ, study hours adds virtually no unique variance to test scores. The strong bivariate correlation between study_hours and test_score (which you would see with ps.pearson) is explained entirely by their shared relationship with IQ — students with higher IQ both study more and score higher.
Normality Testing — D'Agostino¶
The D'Agostino K-squared test combines skewness and kurtosis tests into a single omnibus test for normality. It is particularly well-suited for larger samples.
rng = np.random.default_rng(42)
df_dag = pl.DataFrame({
"normal_data": rng.normal(100, 15, 100).tolist(),
"skewed_data": rng.exponential(10, 100).tolist(),
})
Normal Distribution¶
result = df_dag.select(
ps.dagostino("normal_data").alias("dag")
)
dag = result["dag"][0]
print(f"Normal data: K²={dag['statistic']:.4f}, p={dag['p_value']:.4f}")
Expected output:
The high p-value means we cannot reject normality — the data is consistent with a normal distribution.
Skewed Distribution¶
result = df_dag.select(
ps.dagostino("skewed_data").alias("dag")
)
dag = result["dag"][0]
print(f"Skewed data: K²={dag['statistic']:.4f}, p={dag['p_value']:.6f}")
Expected output:
The exponential data strongly violates normality (significant skewness and kurtosis). This confirms what we would expect from a right-skewed distribution.
When to use D'Agostino vs Shapiro-Wilk
Shapiro-Wilk is more powerful for small samples (n < 50). D'Agostino is better suited for larger samples (n > 50) and specifically tests for skewness and kurtosis. For a comprehensive check, run both:
df_dag.select(
ps.shapiro_wilk("normal_data").alias("shapiro"),
ps.dagostino("normal_data").alias("dagostino"),
)
See Hypothesis Testing for the Shapiro-Wilk test.