Recently I have been thinking a lot about causality, largely triggered by reading The Book of Why and Fooled By Randomness (one of my all-time favourites) in quick succession. The central theme of both these books is causation. In the former, Judea Pearl discusses a new way to quantify the relationship between variables that places the emphasis on causal links rather than pure associative ones. Taleb explains, among other things, how many relationships that look causal may not be. We are easily fooled by randomness.
These books have been accompanied by a collection of other bits and pieces randomly encountered on the internet. Naval’s latest podcast on science and epistemology focuses on how science is really about explanations – explaining why things happen. This question is all about causation.
Is statistics science? It’s not very clear. On the one hand statisticians do make testable predictions. It’s one of the main uses for traditional statistics, in fact. But these predictions are based on associations, not explanations. The statistician is not really concerned with why smoking causes lung cancer, only that there is statistically significant evidence that it does.
It’s more of a starting point. It indicates that there might be a causal link between two variables, something that proper scientists may want to investigate. This is why statistics is powerful: it can reveal causal links that might not have otherwise been considered.
This can be useful in highly complex environments when causation is unclear. It’s easy to understand why not wearing a coat will make you feel cold in winter, we don’t need statistics in this instance. But sometimes it’s not that easy. Think of population numbers in a complex ecosystem – statistics can tell us that, historically, as the numbers of one species rises, another one falls. This may have been unclear (and maybe unknowable) a priori using deductive techniques.
When we don’t know why
What factors influence the difficulty in determining causality?
Firstly, the time scale over which one is examining the relationships between the variables. Some may think that the short-term is easier to handle. There are fewer events, fewer interactions between nodes to incorporate into the analysis. It is easier to explain why a particular shot went in than it is to explain why your Sunday league striker scored 14 goals this season. But this might not be true in all circumstances. Think of financial markets – it’s probably more obvious why the share price of Facebook has risen over the last 10 years than why it went up yesterday or last week.
Next up is scale. What we really mean here is the number of interacting nodes (and the number of interactions). Is it the case that scale leads to complexity? Is it easier to predict the weight of your mate Ben, whose waistline has been slowly but steadily expanding over the last 5 years, than the average weight of the population of the UK over the next year? Surely the latter is just an accumulation of the exact same analysis applied to the former? What about in a domain with many interactions between members of the network, like financial markets. Is it harder to predict the price of BT or a global ETF? It’s not immediately obvious. The price BT surely has fewer determinants but is probably more sensitive to random events. If the global population continues to grow, productivity continues to improve, GDP continues to grow, etc. it’s hard not to see a global ETF increasing in price. The same cannot be said of BT.
In complex environments it may not be possible to construct an accurate map of most of the causal relationships. We are never going to be able to understand how the FTSE 100 price is determined. But can we know enough to be useful? Can we at least identify the major causal factors? Maybe. If not, we will have to rely on statistics, which can get very messy very quickly and, as always, we are likely to be fooled by the conclusions extracted from data-mining-type techniques conducted in these highly complex environments.
The alternative? Simply not engaging in any attempt at inference of this form in these domains. Until a robust framework of causation becomes mainstream and widely-accepted, maybe the best way to avoid getting burnt is to simply pick up your ball and go home.