One quick disclaimer before I proceed. When I have quoted one or more Wikipedia articles in the text, it is because I have found them well-written, informative, and adequately illustrative; however, I shall make no claim as to their veracity and/or authenticity because I have not been able to access and verify all the background references therein. If you find an error, please feel free to chide me in the comments.
An important maxim used in science, or more precisely, in the scientific study of relationships between/amongst variables, is that ‘Correlation does not imply Causation’. Indeed, until and unless such causality has been verifiably established through independent means, any attempt to indicate that it does falls under the logical fallacy of questionable cause, cum hoc, ergo propter hoc (Latin for “with this, therefore because of this”).
It is important for all to understand this concept – those who are engaged in scientific studies, as well as those who read about and interpret such studies.
Correlation is a statistical relationship between two or more random variables; for simplicity’s sake, let’s consider two, say, A and B, such that if changes in the values of variable A statistically correspond to changes in the values of variable B, a correlation is said to exist between A and B. This reflects a statistical dependence of A on B, and vice versa, and therefore, statistically-computed correlations can be used in a predictive manner. To pick a completely random example, the epidermal growth factor receptor (EGFR) is expressed on neoplastic cells in colorectal carcinoma. Number of cells expressing EGFR was found to be correlated with the size of the tumor (adenoma), i.e., cells from a larger tumor express more EGFR. Therefore, EGFR expression may be useful as a prognostic biomarker for adenoma progression.
Those who have already identified the problem in this assertion, congratulations! As the paper cautions, although EGFR pathway is important to colorectal carcinogenesis, it is unknown at this point whether the observed increase in EGFR expression is because neoplastic cells make more EGFR per se for some reason, or because a larger tumor would house numerically more of the cells that are capable of making EGFR. This, as you can understand, is an important distinction, and therefore, the authors conclude correctly that “Further larger studies are needed to explore EGFR expression as a biomarker for adenoma progression.”
Such examples abound, all illustrating how correlations can be useful in suggesting possible causal or mechanistic relationships between variables, but more importantly, such statistical interdependence between the said variables is not sufficient for logical implication of a causal relationship. In other words, while empirically A may be observed to vary in conjunction with B, that observation is not enough to assume A causes B.
But what happens when one makes such an erroneous assumption? For starters, one is then disregarding four other possibilities, any or each of which may be true and account for the correlation.
- A may cause B.
- B may cause A.
- An unknown or uncharacterized third variable C may cause both A and B.
- A and B may influence each other in presence or absence of C in a feed-back loop, self-reinforcing type of system.
- The two variables, A and B, changing at the same time in absence of any direct logical or actual relationship to each other, besides the fact that the changes are occurring at the same time – a situation also known as coincidence. A coincidence may allude to multiple, complex or indirect factors that are unknown or too nebulous to ascribe causality to, or may reflect pure, random chance.
Each of these five hypotheses is testable and there are statistical methods available to reduce the occurrence of coincidences. Therefore, the mere observation that A and B are statistically correlated doesn’t lend itself to any definitive conclusion as to the existence and/or directionality of a causal relationship between them.
Determination of causality is an entirely different ball of wax, and that discussion is beyond the scope of this post. Suffice it to say that in the sciences, causality is not assumed or given. The scientific method requires that the scientists set up empirical experiments to determine causality in a relationship under investigation.
The scientific method works in logical progression.
- Initial observations (of a putative relationship between variables) are made.
- an explanation is proposed in form of one-or-several hypotheses about possible causal relationships, including one of no relationship (the Null hypothesis).
- Certain predictions or models may be generated on the basis of each of the hypotheses, which in turn guide the experimental design.
- Experiments are designed to demonstrate the falsifiability of the hypotheses, i.e., to test the logical possibility that the hypotheses could be proven false by a particular empirical observation. Indeed, testing for falsifiability or refutability is a key part of the scientific process.
- Once designed, the experiments are used to test the hypotheses rigorously, and the data, analyzed critically to reach a conclusion, accepting or rejecting the hypotheses.
- But the method doesn’t cease there. All empirical observations are potentially under continued scrutiny, which involves reconsideration of the derived results, as well as and re-examination of the methodology, especially in the light of newer techniques that are capable of taking deeper and more accurate measurements. Such is the dynamic nature of the scientific method.
Establishment of causality, therefore, has to pass through the same rigorous filters before it can be accepted. But if it does, the conclusions may be considered unimpeachably valid, within the given set of circumstances.
So… Correlation doesn’t inherently imply causation.
Some modern examples are in Part Deux. Please don’t hesitate to comment.