The best part about building a collection of books over the years is that every once in a while, you chance upon a book that helps think through some of the current problems you are working on. The other day, I was flipping through this book called ‘Probability, Statistics and Truth’ by Robert Von Mises, generally regarded as a classic. There is an interesting discussion on Decisions (Bayesian) vs. Conclusions (deterministic based on evidence). With Big Data/AI front and center of the modern enterprise, this topic has become very relevant again, although in a different sense.
For the longest time, data driven decision making was designed to make the process deterministic. This followed a very structured process: start with an end-goal in mind, systematically collect input data, process all the inputs through a series of rules (i.e. an algorithm) to arrive at a solution recommendation. For instance, take the Purchase Requisition Approval process. This typically requires clearly defined criteria based on say, the requisition type, the total amount, urgency and so on. The approval authority is then determined based on well-defined criteria. This in turn, enables these processes to be automated using systems.
We have always known that this deterministic approach is less than perfect. And there are two broad problems with this paradigm:
- We need a clear understanding of the process and all relevant input signals need to be available to be able to take a decision. This slows the entire decision making process.
- In real life, the decision making process is far richer than a linear, rule-based system. There are biases, individuals trying to guess each other’s motives and so on. For the most part, we never had an idea on how to model all this complexity
For the rest of this post, we will focus on #1. #2 deserves a separate discussion.
Why is the deterministic approach failing?
There are two clear pressures: the complexity is increasing at a faster rate than ever and at the same time, there is pressure to reduce the time to decision is increasing all the time. Which in turn, impose two main constraints on the decision making process: an increasing number of inputs and an ever shrinking time to decision.
Take Data – it is clearly one of the most important assets in any company. And that makes the CDO’s objective to ensure Data Quality increasingly important. There was a time when the data assets were relatively low in volume, velocity and variety. And it was possible to ensure data quality through a combination of processes (e.g. data governance) and business rules. That now seems like a different era altogether – with data volumes growing at a dizzying speed, it is increasingly hard, if not impossible to fall back on rules based methods to control the data quality at the point of creation.
What can we do about this?
This is where we need to go back to how humans process data – from the time we roamed the jungles of Africa, dodging predators and hunting for game, we have operated with ‘gut’ – which is nothing but taking lots of information, do some rather clever pattern matching and make a probabilistic choice of action. And that has worked pretty well for us as a species – and there is something to learn from there.
AI offers us that opportunity – to be able to reduce the time to a decision as well as handle exponentially growing complexity by taking a probabilistic approach to decisions. Instead of being constrained by completely deterministic rules, we now have the opportunity to work with incomplete information to make probabilstic recommendations; observing how the recommendations are consumed and using that to further improve the recommendations. Much like ‘crossing a shallow river, feeling one stone at a time’.
And so the CDO can now look at the Data Quality problem in a much different, highly scalable way. Here’s a framework:
- Define a ‘digital fingerprint’ for each data record. A fingerprint – like it says – is really taking as many attributes as possible to create a ‘unique’ definition for each data element.
- Cluster the data records based on the fingerprints – and if well designed, each record would be unique and the distance between any two data record fingerprints will represent the similarity between the pair. Two measures of data quality could then be:
- The outliers in each cluster – these are records that have an aberration
- Too close/near identical fingerprints – these are possibly duplicate records
- Assign a confidence score based on the two measures of data quality for each record. This then becomes part of the data record’s definition – consumers of data can choose the confidence cut-off.
- And the most important step in AI: getting the system to learn. There needs to be a way for users to accept/reject the data confidence score. And that in turn, should be used by the AI engine to calibrate the clusters – in other words, re-define the fingerprints based on user inputs.
And so it is: AI offers us the opportunity to model decision making process as it always meant to be – working with limited information, making intelligent guesses and most importantly, continuously learning and improving.