Over the years, I have seen some projects do well, some fail spectacularly and many more fizzle out. Here are a few observations on some of the behaviors exhibited as well as activities that seem to correlate with generally good outcomes for projects. As always, this list is not exhaustive; neither is it prescriptive.
To begin with, I have played chess for many years now and there are striking parallels:
- When you start a game, you have no idea what the eventual outcome is going to be – but that never stops you from starting with an overall strategy. Let’s call it the ‘ultimate’ strategy – and you fully well know that this is also dynamic and needs to evolve
- At any stage, it is almost impossible to figure out the perfect move (computational overheads just won’t let you do that). Given that, you focus on the next best move – one that fits into the current strategy (proximate) and does not conflict with the ‘ultimate’ strategy
- And after each move, you need to learn by using the feedback loop (i.e. the opponent’s move and the state of the game) to re-evaluate the ‘ultimate’ strategy and adjust the ‘proximate’ strategy if required.
And with that, here are my 5 nuggets:
- Front-load design:early on, identify specific areas where you need to spend time. This more of an art than a science and can determine the course a project will take. This is almost a truism – but tends to happen over and over again. And the biggest time and effort sink early on is typically data understanding. Teams get sucked into this and before you know, weeks have flown by. What is useful is to make sure that someone in the team is thinking about equally important issues – go broad (e.g. what is the consumption strategy; does the model need to be deployed in an IT system etc.) and deep (e.g. what should be the dependent variable in your model; what is the level at which the model needs to be built etc.)
- Identify ‘the’ Modeling strategy: AI/ML is obviously the flavor of town. And this seems to be pressuring every data scientist to chase the next fancy ML technique. Meanwhile, perhaps the best kept secret in Analytics is that Regression (Linear and Logistic) get you started and even provide good enough answers to start engaging with business stakeholders. Too often, means are confused with the ends. It is useful to remember two simple guidelines:
- Explore multiple modeling techniques – always default to a champion/challenger model. This will need you to put a rudimentary process to evaluate models and in cases, an automated process to do the model selection as well. Well worth the trouble.
- Start with an initial portfolio of models and then improve the existing set of models with feature additions, feature transformations to converge to an acceptable solution. Important to drive convergence with the business stakeholders – more often than not, business rarely, if ever, needs the most accurate answer. The same applies to most decisions that business stakeholders need to make – which is why speed and an effective feedback loop are features that you see in the best of organizations.
- Define success criteria: Too often, projects don’t have a clear definition of success parameters – starting with a clear definition of how to measure model accuracy, in purely statistical terms. Then comes the ‘field testing’ – from designing of experiments (e.g. A/B tests) to user validation. And finally, the operationalization process which requires guardrails. Say you have built a sales opportunity propensity scoring model. Once you have established the model accuracy, you will need to plan a controlled experiment – where you publish the propensity scores to a subset of sellers and observe the incremental lift, if any, from the improved prioritization for the sellers. You will need to prove that the model improves the overall conversion process across the funnel before the scoring model is integrated into the Sales systems and the scores are truly used to drive business decisions. As you might imagine, the post-model development phases are the most arduous, and yet, most important for the success of the project.
- Data will be sparse – deal with it: In certain situations, big data could be a myth. Start with this premise – and then move forward, especially when it comes to feature engineering, target variable identification and in general, the modeling strategy. One typical example is making predictions at the lowest level of granularity – e.g. forecasting at a SKU/country/week level typically ends up being a sparse data problem. It comes down to the same question: what is the problem you are trying to solve and what is the best method to solve the problem? Data (big or small) is an enabler.
- Exploit the power of Exploratory Analysis (EDA) and Insights: EDA usually throws up a lot of interesting insights. This has to start with basic univariate analysis and then quickly progress on to hypothesis testing. This is also a great opportunity to engage with stakeholders and get a feedback loop established early on. And perhaps most importantly, it gets the Data Science team out of the ivory tower and engage better with the problem space.
Like I said, this is by no means complete. A good starting point nevertheless