Causality

Simpson's Paradox: Shadow versus slices of an ellipsoid

Simpson's Paradox

The simple regression of 'Heart' on 'Coffee' will appear as a blue plane and the multiple regression of 'Heart' on 'Coffee' and 'Stress' eventually appears as a red plane. The ellipsoid is the 'data ellipsoid' whose one dimensional projections on any line produce the mean plus or minus one standard deviation of the projection. It is the unit sphere for Mahalanobis distance.

Simpson's Paradox

Simpson's Paradox applied to causality involves three variables: a response Y, a potential cause X and a conditioning variable Z. Each variable can be a continuous numeric variable or a categorical variable. This results in eight distinct graphical visualizations of the paradox.

In this artificial example, we have three continuous variables. We suppose that Coffee (X) is a mild palliative for the harmful effect of Stress (Z) on Heart Damage (Y). However, if we simply regress Y on X without controlling for Z, Coffee appears very harmful. There is a strong positive (i.e. harmful) relationship between Y and X because both are strongly related to the confounding variable Z. In fact, there is a strong positive relationship between each pair of the three variables. However, when controlling for Z, the conditional relationship between Y and X becomes negative.

If we merely observe the unconditional relationship between X and Y we could conclude that Coffee is harmful. However the conditional relationship, if indeed Z is a confounding factor and not a mediator, reveals that Coffee could be mildly beneficial.

The geometry of the data ellipsoid plays an interesting role. The 'unit data ellipsoid' consists of points whose Mahalanobis distance from the mean of the 3-dimensional data cloud is 1 -- it's the unit Mahalanobis sphere. It captures (and is equivalent to) all first and second moments of the data cloud. Any statistical procedure, such as least-squares regression, that depends only on the first and second moments of the data cloud can be expressed as a geometric function of the data ellipsoid.

The simple regression of Y on X is determined by the data ellipse that is the frontal projection of the data ellipsoid on the X-Y plane. The multiple regression of Y on X and Z has conditional regression lines, given Z, that are determined by the frontal sections of the data ellipsoid. Thus the marginal relationship between Y and X corresponds to the projection of the ellipsoid (shown by a blue ellipse) while the conditional relationship corresponds to the sections of the ellipsoid (shown by red ellipses).

The principal frontal section ellipse (at the centre of the three-dimensional ellipsoid) is the data ellipse of the added-variable plot, also known as the partial regression plot, for the regression of Y on X controlling for the linear effect of Z.

The geometric representation of Simpson's Paradox is that the projection of the ellipsoid and its sections can have opposite orientations.