#79: Demand Forecasting and Language Models

Demand Forecasting is a widespread application for Machine Learning. Almost every enterprise, across various sectors, keeps investing in this area. None more so than industries with physical Supply Chains, where it is estimated that a 10-20% improvement in forecast accuracy can translate to a 5% reduction in inventory costs and 2%-3% increase in revenues (McKinsey). For decades, we have relied on supervised learning time series methods (ARIMA, Prophet et al) but as practitioners will say, these require extensive data preprocessing, feature engineering, and constantly monitoring and adjusting the forecasting algorithms. And of late, Demand Sensing has add another layer of complexity – especially since a lot of the near real-time signals come from diverse sources, both structured (e.g POS transactions, weather data) and unstructured (e.g. social media feeds, customer reviews) that tend to be noisy (susceptible to external factors) and confounding (e.g. weather events might drive POS transaction patterns). Given all this, data scientists are always on the lookout for better approaches to get that extra bit of accuracy. With the emergence of LLMs, there are two interesting approaches that are worth considering.

Approach 1: Deep Learning models

On last weekend’s long run, I listened to a fascinating interview with computer scientist Tina Eliassi-Rad episode from the Mindscape podcast (highly recommended) where she talked about one of her research papers to predict human life events ‘Drawing on a unique dataset consisting of detailed individual-level day-by-day records, describing the 6 million inhabitants of Denmark, spanning a 10- year interval, we show that accurate individual predictions are indeed possible.’ What intrigued me was the approach: ‘We use transformer models to form compact representations of individual lives. We call our deep learning model life2vec. The life2vec model is based on a transformer architecture. Transformers are well suited for representing life-sequences due to their ability to compress contextual information and take into account temporal and positional information.” Here’s the research paper in full (worth reading). It must be noted that all this was possible given the richness of the data set. The two primary building blocks for this approach,

Symbolic representation of the underlying data: The research team developed a formal vocabulary – in other words, converted each category of discrete features (e.g. occupation) and discretized continuous features (e.g. number of years in a job) into a formal sentence. For instance, an event is transposed into a formal sentence: “In September 2020, Francisco received twenty thousand Danish kroner as a guard at a castle in Elsinore”. And put together, created a life journey for an individual. Needless to say, this is domain specific and developed by human experts
Deep learning model that was trained using this corpus of life events to predict the next token in the sequence (i.e. the next stage in the individual’s life journey). The model performed markedly better than mainstream state-of-the art models trained on the same data in making a wide spectrum of predictions – from the occurrence of a discrete event (e.g. death-prediction) to personality nuances

Approach 2: Training Transformer models with time-series data

A team from Amazon Science trained a transformer model (in this case, T5, one of the early LLM going all the way back to 2019 – sounds ancient in these breathtaking times!) with a corpus of time-series publicly available time series data sets (augmented by synthetic datasets) which were subject to just two treatments:

Scaling the time series values, a prerequisite which most will immediately recognize as pretty standard data transformation for forecasting. In this case, they normalized the data using a mean-scaling method, which normalizes the individual entries of the time series by the mean of the absolute values in the historical context.
Chunking: This was necessary given that time-series data is a series of real-values and cannot be processed directly by language models. To get around this, they used a simple but clever method – basically bucketing the data points into discrete bins (i.e. tokens) with a labeled ‘bin center’, which then formed a sequence that was used to train the model to predict the next token in the sequence.

Interestingly these pretrained time series models (called Chronos) also out-performed several popular time-series models (e.g. AutoARIMA), and as the paper says, “Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines”. See the paper for detailed benchmarking results

So what?

All this looks like pretty cool research (and it is), but how do we apply them? What triggered me to go down this path was a question: how is a language model that predicts the next token different from a time series forecasting model that predicts the next values? At the surface, LLMs generate tokens from a finite corpus versus forecasting models generate a predicted value from an unbounded, usually continuous domain, but then again, they are both trying to learn from the sequential structure of data to predict future patterns. Which leads me to the hypothesis: if there is a way to train language models on the specific “language of time series” (with help from a formal vocabulary that captures the relevant context) and get them to efficient and effective, we would be able to scale out time series predictions without going through the expensive data preparation that currently consumes a significant portion of forecasting systems. Which in turn, could enable us to truly expand the data sources, improve the granularity, refresh forecasts based on near real-time signals and so on. Compare that to the existing paradigm, where forecasting runs are relatively coarse-grained (trade-off with accuracy) and frequency of the runs are compute capacity constrained. These two put together inhibit the ability for the forecasts to be refreshed based on near real-time signals. This in turn, forces data scientists to apply a separate set of Demand Sensing algorithms to adjust the baseline forecasts based on real-time information. This definitely needs some more work, but I am pretty sure that as we improve our understanding of deep neural net models, the world of forecasting is going to see transformation. Exciting times ahead!

Share this:

Leave a comment Cancel reply