Talk to your analyst about the biggest challenges with the adoption of Machine Learning to drive business impact and you are likely to hear data as one of them. While there has been a huge improvement in the quality and volume of data, there is also a growing sense in organizations that there is a power law of sorts that is emerging in the problem space itself. At the risk of over-simplification, let’s invoke the 80/20 Pareto principle: metaphorically, 80% of the problems that you need to solve are now more or less well understood and if you have a decent team of analysts and have invested in BI and Data Science tools, it would take up 20% of your analytics team’s time to solve them. Conversely, a big (and increasing) share of time and effort will need to go to focus on the long tail of problems: solving them is going to create disproportionate value for the organization. Needless to say, these are not easy to solve – among the challenges, data is probably the biggest one. Over the last few weeks, we have looked at some of the strategies to work with small data.
Today, I want to focus on a specific challenge: what do you do when the life-blood of Machine Learning– labeled data for training the models – is not adequate? The traditional method almost all data scientists follow is to generate random samples of data from the underlying distribution and use that to train the models. You would call this method ‘passive learning’ – after all, once you pick a distribution, you stick with it. To be sure, your data scientist should be trying different sampling strategies, but there is no attempt to learn about the underlying distribution itself. Which brings us to Active Learning – and the key idea is that can we ‘help’ the algorithm to achieve greater accuracy by improving the training data? Here is a hypothetical example that serves to make the point. The figure is taken from this survey on Active Learning (well worth a look).

(a) shows the fairly common problem of classification – how do you create a decision boundary to divide the population into 2 clusters? (say ‘low-risk’ vs. ‘high-risk’ customers). And since you don’t have the luxury of knowing upfront the labels (i.e. red vs. green), the strategy would be to draw a sample from the population, label them and use that to train the model (in this case, logistic regression). (b) is what you would get with a random sample – it is clear that there is room for the model to be improved since the model is skewed away from the red points. In the normal course of events, we would take this is a model accuracy constraint and adjust the decision strategy accordingly. On the other hand, if you could figure out a way to get a better data sample, you could improve the model accuracy – which is shown in (c).
The underlying idea is quite intuitive and very compelling. Instead of being constrained by sampling strategies and leaving it to the machines to train models, what if humans can play a very active role in guiding the machine towards a better solution? I have always been a strong advocate of solution strategies that combine human intuition with the algorithms (AI = Augmented Intelligence) – and this could be an example of using that mindset. See below for a simplified flow:
- Start with the standard method of training a model using the training dataset (based on a sampling strategy)
- Use that trained model to score the population. Most standard data science projects end here – the model accuracy is a constraint you live with
- Before you baseline your population score, insert a human into the process – and get them to focus on specific data points where the machine not very confident (i.e. the confidence level is below a certain threshold).
- The human can take a look at this subset and accept/change the machine generated label. Which then is taken as the revised score.
- And in the next iteration, this labeled data goes into the mix for training the model.

In the Active Learning community, the human in the process is called an ‘oracle’ (in mythology, an oracle is a person who could make prophetic predictions – hence the name). And once we have this framework to work with, there are multiple strategies to improve the quality of labeled data – from a human agent actually updating data labels to defining rule sets to improve the sampling of data. For instance, if you are re-building the credit risk model in the post Covid-19 world, you might want to guide your training data creation to ensure that you factor in specific cohorts (e.g. customers with professions which have been known to be worst affected). Left to itself, your training data sampling process would not be able to factor this – unless our ‘oracle’ can use her tacit knowledge to guide the algorithm in the right direction.
There is lots going on in this space (link below). And one last thing – there is inherent potential for contradiction here. Once you insert humans into this process, can human biases be far behind? What if the ‘oracle’ is not so rational after all, and tries to steer the algorithm to reinforce her individual biases? More on that later.
Some links:
Research on ‘inquisitive’ machine learning: http://active-learning.net/
Side story: There is a whole industry around data labeling. This is from when the internet obsessing with identifying dogs and people in photos. I sincerely the last few months (and especially the last couple of weeks) will force us to take a step back and re-work our collective priorities. There are bigger problems to solve than getting machines to label dogs and cats. https://www.ft.com/content/56dde36c-aa40-11e9-984c-fac8325aaa04
Leave a Reply