My friend is trying to improve his basketball game and he finds himself at the basketball court in his neighborhood just about every day. He has set himself a target of practicing free throws until he gets 5 shots in a row. Anytime the streak is broken, his counter starts all over again and he keeps going until he gets his first 5-in-a-row, at which point he gets to go home. The really interesting question he had when we met over the weekend: What is the distribution his plays follow and is it possible to predict how many attempts he needs on any given day? For now, let’s assume that the probability of a successful throw is fixed and is an independent event (we know that both of these assumptions are not valid in real-life). It seems like a fairly straightforward question until you start thinking about it: this goes into the heart of what it means to model real-life phenomena.
How do we model?
We routinely create deterministic mental models to explain broad overall behavior and patterns in just about every area. The most pervasive one is the Normal distribution, which we know by now, generally works well when we are dealing with large populations (i.e. where we can seek refuge from the ‘law of large numbers’). In this case, if we were trying to estimate the probability of a 5-in-a-row shots within a certain number of attempts for a given population (say, all the basketball enthusiasts in the neighborhood park), we should be able to define a deterministic probability distribution (typically, normal).
However, when we get to the specific case of my friend, this deterministic model of the normal distribution no longer holds true. His specific process of shooting the ball follows what we call a stochastic process: a series of observations, with each observed value being a random variable. And that is specific to him as an individual.
Formally speaking, Stochastic process is any sequence of random variables whose behavior is non-deterministic in that the next state of the environment is partially but not fully determined by the previous state of the environment. (Feel free to ignore this sentence – just know this is a fundamental characteristic of stochastic processes)
We can (and do) generally make assumptions for the probability of a single event (in this case, ability to basket a single throw based on his skill-level). However, the ability to predict an outcome (defined as a series of 5-in-a-row baskets) requires us to make observations over time. Needless to say, the more observations we have made, the better we can predict the outcome at a later time. By now, you would have figured that this is generally cumbersome and just about hopeless to come to a generalized form.
Why should we care?
You are probably wondering why all this deserves a blogpost. While we have always known that much of the real-life events that take place around us are stochastic, the time, effort and costs of studying phenomena at that level of detail had been prohibitive, if not impossible. And we got around this by making simplifying assumptions at an aggregate level: e.g. biological/biochemical modelling was restricted to an overall organ level as opposed to single-cell level analysis; inventory planners were forced to plan safety stocks based on aggregate product family behavior as opposed to individual SKU at a store level. The list goes on.
All that is changing, courtesy the computing capabilities: now you can spin up a cloud instance, fire up compute and storage capacity at will and get down to modeling and simulating stochastic processes with unprecedented levels of detail. Stochastic modeling at this level of specificity has been around in biological/biochemical research for some time now – and it is about time that these methods make their appearance in organizations. Enterprise cloud investments are maturing to a point where it has become feasible to explore these opportunities in earnest. For instance, supply chain teams at Retailers can study micro-behavior at a product/store level by running simulations and using that to determine inventory actions; Customer Service teams at Technology firms can create ‘digital twins’ of their customer installations, get feeds of usage metrics, log files and so on to simulate product behavior to predict outage at an individual level; proactively determine targeted service interventions etc.
Back to my friend’s question: What is the distribution his plays follow and is it possible to predict the number of attempts on any given day? We now know that it is not possible (or even correct) to think of a distribution and hence, predict the number of attempts. What we can do is to run simulations and estimate the average number of attempts (with a standard deviation) he will need to meet his goal on any given day. It took me less than 50 lines of Python code and less than 30 seconds of compute time for running a 1000 iteration simulation model on a free cloud account.
The possibilities are exciting and we are just getting started.
PS: If you were to look for a father of Stochasticity (yes, it is a word!), it would be Markov (see an earlier post) . One thing led to another and in the 1930s, Claude Shannon gave the world Information Theory as a formal means to study stochastic processes. If you want to know more, do take a look at this video that uses Wordle (these days, everything seems to connect back to this game!) as an example to explain Information Theory. While you are at it, you would do well to subscribe to Grant Sanderson’s channel: www.3blue1brown.com
Leave a Reply