#69: Small and wide data

Those of us who have spent enough years in Data and Analytics know that this industry, more than most in tech, is rife with jargon and ever so often, the irresistible urge to spin some new buzzword (to some, this is how the industry continues to maintain the hype and they wouldn’t be wrong. But that is a topic for another day). And so when Gartner brought out a key trend in 2021 – ‘From big to small and wide data’, many in the trade felt a sense of déjà vu (rightfully so). For those of us in the B2B world, this has more or less been the state of Analytics for several years now. And as someone who has been doing B2B Analytics for a living for many years now, here are some of my notes from the field.

What makes data ‘small and wide’?

Let’s begin at the beginning – because that is always a good place to start: What defines ‘small and wide’ data?

  1. Limited generation of data: A fundamental, although usually overlooked constraint: there is often just not enough sources of data. For those of you from the B2C world, this is difficult to grasp: after all, all you need to do is to release a feature in a well-defined market (pick a country – say Australia!), sit back and capture consumer behavioral metrics. A typical B2B firm on the other hand, is working with a client base which is small to begin with (especially in the Enterprise segment). And since they would likely be paying hefty amounts, they will want to put you through the many hoops before even letting you capture any data. Think of small data as a design feature.
  2. Diverse signals: A typical B2B enterprise touches the customer from multiple different functions: Marketing (Customer engagement); Sales (Contracts, Cross-sell/up-sell, Renewals); Support (Tech Support) and of late, Product Engineering (PFRs – product feature requests, co-operation on product roadmaps etc.) All this means multiple engagement touchpoints and that creates wide data. However, there are often two constraints:
    1. The systems supporting these interactions tend to be siloed and stitching together these diverse signals into a coherent customer journey map is easier said than done
    1. Data is no longer monolithic – and starts to cover the entire gamut: from structured (Sales Orders, Support tickets) to unstructured (Sales Contracts, Case logs). In fact, unstructured data tends to be richer – the question then shifts to the ability to extract signals from all the noisy data
  3. Narrow, repetitive data: The good thing about B2B customers is that they tend to be sticky. Once an application makes it through into a production environment, chances are they will be around for some time. And that allows for observations to be captured over lengthy periods of time. In other words, small and wide data often comes with narrow and deep data.
  4. Tacit knowledge: Every B2B firm has employees who have “spent many, many years in the job and seen it all”.  There is a tremendous amount of tribal knowledge and experiences that are often difficult to capture in terms of data. And this extends well into the long tail and is not captured in a structured fashion. Which in turn makes it difficult to train ML models.


 All this creates a set of challenges when it comes to delivering business value from ‘small and wide’ data in a scalable, repeatable manner. While the list below is certainly not exhaustive, it is a good start and I suspect that most of you would have seen this in one form or another (I will cover a few design tenets that might help):

  1. Signal to noise ratio: With unstructured data (think customer call transcripts, server logs) comes a lot of noise. Extracting meaningful signals from the noise is probably the single biggest challenge. This is often compounded by limited datasets that can fully capture the customer context. All this creates either a cold-start problem, model accuracy leading to poor predictive capabilities for ML teams trying to train and deploy inference models to automate some of the decision-making processes.
  2. Data in silos: Perhaps even more than noisy data, the data stuck in different databases continues to be an impediment, especially in many of the firms which have grown their data and technology stacks organically over the years. Most companies have come to accept that this is a problem that cannot be solved by throwing (lots of) money and chasing the dream of a unified data warehouse (how many times have we heard of white elephant projects titled X360 – take your pick for X = Customer, Product et al). This much is clear – the paradigms of the centralized warehouse or even the next iteration, the big data ecosystem with a data lake, have largely over promised and under delivered.
  3. Capturing tacit knowledge: It is generally true that the most experienced in any team (Sales, Product, Customer Care etc.) are not the most embracing of digital transformation. The challenge here is to figure out mechanisms to capture the nuances that have been built over decades of experience into learning systems

This list is by no means exhaustive – however, most challenges would fall under these three broad areas and it stands to reason that any Data and Analytics strategy must address these three challenges.

Design Tenets

A quick aside. Why call them Tenets? Because these are the essential ideas or principles that should guide a solution design and implementation

  1. Go beyond the organization and embrace open-source ML: There has been an explosion of open-source activity in the ML world, with multiple organizations (AWS, Microsoft, Google et al) putting models out in the open-source domain. These included models trained on internal datasets and then released in the open-source domain – and what makes this exciting is that now others can use these models directly for inference and get around the training problem. Moreover, these models are increasingly becoming domain and industry specific. One such example is Hugging Face, which offers NLP models ready to deploy for inference. A full integration with a cloud ML service like AWS SageMaker allows for easy deployment and operationalization. All this means that there is a potential way out of the problem of limited, noisy data that hamper the adoption of ML based decision automation
  2. Embrace the Distributed Data Mesh architecture: Small data often by itself, has limited value. One way to get around the small ‘data ponds’ in functional silos to re-imagine the data ecosystem. The Data Mesh is turning out to be more than emerging architecture pattern to share, access and manage data across multiple environments – the main idea here is to develop a data governance model that helps evolve the Data Lake into a Data Marketplace. There is lots of interesting work going on in this space – and if pulled off, could provide a way to meaningfully create and deliver value from data at an enterprise level.  
  3. Understand pockets of depth in small data: Anyone who has worked with enterprise data has seen narrow slivers of data that is ‘deep’. Think of timeseries data of metrics which are captured with metronomic frequency. Observability is increasingly becoming important – and not just in specific niches like log analysis, but also with operational and business KPIs. Armed with capabilities like Anomaly Detection and Automated Root Cause Analysis, there is an emerging opportunity for capturing early warning indicators and allowing ML and analytical tools to do automatic triaging to identify actionable areas. Data maybe small in scope, but deep and rich in the ability to power decisions.
  4. Experimentation in the Enterprise: Experiments have embedded as the de rigueur mechanism in the B2C toolkit. Surprisingly, not so in the Enterprise context. That could change – especially when it comes to attempting to overcome the ‘data inertia’ that continues to be the barrier to capture the pockets of tacit knowledge. One potential way to win the skeptics over is to run controlled experiments to prove the advantage of data-driven recommended actions over the gut-based, purely experiential actions. Think of a Next Best Action engine with cohorts of Sales, both in Strategic (multiple touch-points with different stakeholders within a client) and Enterprise (touch-points with stakeholders across different clients).  When designed and executed well, experiments offer a way to methodically build a corpus of intelligence to augment the otherwise small data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: