5. “Correlation is Not Causation”#

In this article, some details about one of the most common sentence in statistics are given. Starting from the sentence “Correlation is not causation”, the definitions of correlation, independence, and causation are discussed.

5.1. Definitions#

5.1.1. Correlation#

Correlation measures any statistical relationship between two random variables, wheter it is statistically dependent or not, causal or not.

The most common measure of correlation is Pearson correlation. Pearson correlation between two random variables \(X\), \(Y\) is defined as the ratio

\[\rho_{XY} := \dfrac{\text{cov}(X,Y)}{\sigma_X \sigma_Y} \ ,\]

of the covariance \(\text{cov}(X,Y) = \mathbb{E}\left[ (X-\mu_X) (Y-\mu_Y) \right]\), being \(\mu_Z\) the expected value of variable \(Z\), \(\mu_Z = \mathbb{E}\left[ Z \right]\), and \(\sigma_{Z}\) its standard deviation, \(\sigma_{Z} = \sqrt{\mathbb{E}\left[ \left( Z - \mu_Z \right)^2 \right]}\).

5.1.2. Statistical independence#

Two random variables \(X\), \(Y\) are statistically independent if the conditional probability \(p(X|Y)\) is equal to the unconditional probability \(p(X)\),

\[p(X|Y) = p(X)\]

Thus, joint probability reads

\[p(X,Y) = p(X|Y) p(Y) = p(X) p(Y) \ , \]

i.e. joint probability is the product of unconditional probabilities of independent random variables.

As \(p(X,Y) = p(Y|X) p(X)\), it also follows that \(p(Y|X) = p(Y)\).

5.1.2.1. Statistical independence implies no correlation#

As statistical independence of variables \(X\), \(Y\) implies \(p(X,Y) = p(X) p(Y)\), direct computation of the covariance \(\text{cov}(X,Y)\) reads

\[\begin{split}\begin{aligned} \text{cov}(X,Y) & = \mathbb{E}\left[ \left( X - \mu_X \right) \left( Y - \mu_Y \right) \right] = \\ & = \mathbb{E}\left[ X - \mu_X \right] \mathbb{E} \left[ Y - \mu_Y \right] = 0 \ . \end{aligned}\end{split}\]

5.1.2.2. Correlation of samples drawn from independent random variables#

Sample covariance \(\hat{S}_N\) of \(N\) samples \(\{ (X_n, Y_n) \}_{n=1:N}\),

\[\hat{S}_N := \dfrac{1}{N-1} \sum_{n=1}^{N} \left(X_n - \overline{X}_N \right) \left(Y_n - \overline{Y}_N \right) \ ,\]

drawn from random variables \(X\), \(Y\) is a random variable with zero expected value, but its realizations are non-zero in general.

In other words, samples of independent (and thus uncorrelated) variables have non-zero covariance and then non-zero correlation, in general.

5.1.3. Causality#

Causality is the relation between two events, in which one (the cause) is - at least partly - responsible for the other event (the effect), and the effect is - at least partly - dependent on the cause.

Principle of causality relation implies that the cause comes before the effect.

In general, an event may have multiple causes (that lie in its past) or have multiple effects.

5.1.3.1. Necessary, sufficient and contributory causes#

  • \(x\) is necessary for \(y\) is the occurence of \(y\) implies a prior occurrence of \(x\)

  • \(x\) is sufficient for \(y\) if the occurrence of \(x\) implies the subsequent occurrence of \(y\)

  • \(x\) is contributory for \(y\) if it’s one among several co-occurrent causes.

5.2. Pearl’s work, Causal Inference in Statistics: A Primer#

5.2.1. Ladder of causation#

Three levels of causation:

  • Association is defined as the conditional probability,

    \[P(A|B) \ ,\]

    and has no causal implication: there’s no cause-effect directionality, or both can be caused by a third event

  • Intervention needs for an event to be performed (and not just observed), in the minimal way, with minimum intrusivity and unintended effects on the world. This action is represented mathematically using the do-calculus formalism. In order to quantify the effect of performing action \(B\) on \(A\), the probability

    \[P(A| \text{do}(B)) \ ,\]

    is required, being \(\text{do}(\cdot)\) the operator representing the intervention

  • Counterfactuals involves the consideration of an alternate version of the cause (past event), and the analysis of the effects for the same experimental unit/system of interest. …

    \[P(A| B, C)\]

5.2.2. Model#

Causal diagram: directed graph showing causal relationship, built with nodes (set of variables) connected with arrows representing causal influence.

Elements.

  • Junction patterns:

    • chain, \(A \rightarrow B \rightarrow C\)

    • fork at \(B\), \(A \leftarrow B \rightarrow C\)

    • collider at \(B\), \(A \rightarrow B \leftarrow C\)

  • Node types:

    • mediator

    • confounder: affects multiple outcomes, creating a positive correlation among them

    • instrumental variable…

5.2.3. Associations#

5.2.4. Interventions#

5.2.5. Counterfactuals#

5.3. Examples#


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr, chi2_contingency

sns.set(style='whitegrid')
np.random.seed(42)

# Simulate correlated data
x = np.random.normal(0, 1, 100)
y = 2 * x + np.random.normal(0, 1, 100)

df = pd.DataFrame({'x': x, 'y': y})
sns.scatterplot(data=df, x='x', y='y')
plt.title('Scatter Plot of Correlated Variables')
plt.show()

# Pearson correlation coefficient
corr, p_value = pearsonr(df['x'], df['y'])
print(f"Pearson correlation: {corr:.2f}, p-value: {p_value:.3f}")

# Simulate independent variables
a = np.random.normal(0, 1, 100)
b = np.random.normal(0, 1, 100)

df_indep = pd.DataFrame({'a': a, 'b': b})
sns.scatterplot(data=df_indep, x='a', y='b')
plt.title('Scatter Plot of Independent Variables')
plt.show()

# Correlation test
corr, p_value = pearsonr(df_indep['a'], df_indep['b'])
print(f"Pearson correlation: {corr:.2f}, p-value: {p_value:.3f}")

# Simulate a confounding variable
z = np.random.normal(0, 1, 100)
x = 2 * z + np.random.normal(0, 1, 100)
y = -3 * z + np.random.normal(0, 1, 100)

df_spurious = pd.DataFrame({'x': x, 'y': y, 'z': z})
sns.scatterplot(data=df_spurious, x='x', y='y')
plt.title('Spurious Correlation via a Confounding Variable')
plt.show()

corr, _ = pearsonr(df_spurious['x'], df_spurious['y'])
print(f"Correlation between x and y: {corr:.2f} (spurious)")

5.4. Your Turn: Explore Causation#

Try changing the relationships between variables and test for correlation. Does correlation imply causation? Try creating a scenario where there is causation but low correlation.