October is the Nobel Prize month – and most years, we really don’t pay all that much attention (except perhaps the Peace Prize where everyone seems to have an opinion). But then, every once in a while, the Nobel Committee ends up landing on a body of work that you can directly see around you. The 2021 Economics prize is one such case – a big part of the empirical work in Economics that we read and more importantly, guides public policy is a direct consequence of the work done by this year’s winners. As the citation says, the award went to David Card of the University of California, Berkeley, “for his empirical contributions to labour economics”, and to Joshua Angrist of MIT and Guido Imbens of Stanford University “for their methodological contributions to the analysis of causal relationships”. Their work on ‘natural experiments’ – once you dig your teeth into it, is devilishly clever and fundamental. And this in turn, helped spawn a mini-revolution of sorts in ‘causal inference’.
As human beings, we have always known about causal effect, and even more critical is our ability to understand the difference between association and causation. Philosophers across cultures have thought long and hard about causality at various levels of abstraction. That is a fascinating topic in itself – but that is a topic for another day.
This year’s Nobel Prize winners’ contribution is more relevant to a typical (and extremely knotty) real world problem: how do you attribute the cause of an action to an outcome? Does the statement “X caused Y” also mean that Y is present because of X AND Y would not have been present if X were not present? This is by no means a semantic question – but could have profound real-life implications. In fact, businesses in different industries and contexts try to answer this very question all the time: from a marketing manager trying to discern the effect of a promotion to far more consequential situations like a pharma firm trying to prove the effect of a drug to cure a disease. It is worthwhile to repeat the obvious methodologies that we all have used:
- Learn from historical data: Regression (and any supervised learning method) is all about looking at historical occurrences and estimating the causal relationships between actions (independent variables) and outcomes (dependent variable). And as anyone who has developed statistical models knows, this is not exactly the most reliable route. Even more so in the current world of exploding complexity – which keeps reducing the signal to noise ratio, making it ever harder to draw meaningful causality from historical data.
- Learn from controlled experiments: When you don’t have the luxury of learning from the past (e.g. new product launch, drug trials) or it is just not reliable to learn from historical data (e.g. changing conditions) it is best to design and execute experiments and use them to draw associations and if we get really good and/or lucky, establish a causal relationship. Billions of dollars ride on the design and execution of controlled trials to prove the causal relationship between the drug and the cure (ask any pharma manufacturer)
But here’s the rub – what if you are unable to do either of the two? We are going through this very situation: Covid-19 continues to impact just about every facet of our lives and almost every business is scrambling to answer the question: What impact will Covid-19 have on their business? For instance, every consumer bank is trying to figure out how the loan default rates will change with all the economic uncertainty triggered by Covid-19. The fundamental problem to answering this question is obvious – there is no way for the banks to know how loan default rates would change if there were no Covid-19. Or in the jargon, there is no way to reliably measure the impact since we don’t have the right counterfactuals. To further complicate things, the observed association between Covid-19 and loan default rates will have the following challenges:
- Estimating the actual causal effect of Covid-19 related unemployment on the loan default rates in the absence of multiple interacting variables (e.g. Covid-19 may have also caused supply chain disruptions which in turn led to working capital issues, leading to loan defaults and so on)
- There could be a third effect (called a confounder) which could impact both the driver (unemployment) and the outcome (loan default). For instance, the PPP loans propped up some of the businesses and helped stem unemployment but at the same time, increased the loan liabilities for small businesses leading to greater defaults (as it seems to be playing out).
If all this starts to give you a headache, you would not be alone. And anyone who has spent enough time working with data would have had handle these types of scenarios – and as any Data Scientist worth her salt will tell you, these scenarios are only becoming more common in the ‘new normal’ – as business continue to deal with ever-increasing complexity.
This is where an important element of this year’s Nobel Prize winners’ work comes in: the idea of natural experiments with plausible treatment and control groups – with what has become known as the ‘difference in differences’ [see picture].
Their work (original papers are worth reading) have spawned a slew of empirical economics studies, the result of economists opening their eyes to the natural experiments all around them. It is about time that data teams in organizations take inspiration from this concept as well. A word of caution: there is an important, potentially fatal flaw in this method: it is impossible to prove the parallel trends (the counterfactual problem!). That is a topic for another day.
Difference in Differences: http://www.publichealth.columbia.edu/research/population-health-methods/difference-difference-estimation
The Credibility Revolution in Economics: https://economics.mit.edu/files/5566