#80: Measuring Human and AI Agents

Since 2023, we have been hearing doomsayers talking about the coming AI Apocalypse that will take our jobs, run economies etc. And they keep pointing to the impressive performance gains and the reasoning capabilities of the frontier models. Meanwhile, those of us in Enterprise AI continue to be frustrated by the temptation to treat these headlines based on benchmarks as a direct proxy for business value. Anyone who has struggled to deploy a AI customer chatbot in production knows that we are nowhere close to this threatened armageddon. To be sure, we are beginning to see some signals, particularly that the unemployment rate for 22-27 year olds is higher than the national average (5.8% vs. 4.2%, the first time in 45 years), but what this means is a topic for another day.

What I want to cover in this post is a basic hypothesis: It is reductive to think of a comparison of human vs. algorithmic performance and make this an either-or debate. In reality, we know that every business process is a collection of tasks that follows a power law distribution with a higher frequency of low complexity tasks and a ‘fat-tail’ of complex tasks. We have had a long history of automating the low complexity tasks (think productivity tools, ERP systems, dashboards and Analytical tools) and creating capacity for the humans to deal with the complex, edge cases which typically require human judgement and experiential knowledge.

Given that, the question that we need to be thinking about is: ‘What is the optimal way to distribute tasks between human and AI agents for a given business process?’ 

Sounds obvious, but this raises the question: what is a standardized framework to measure the task level performance of human and AI agents? This does sound a bit 1984-esque but I think it is important to address this if we are to be intentional about striking the right balance with AI.

Lessons from the field

The best part about my work is that I get to work with customers who are looking to deploy Agentic AI Systems with all the messy real-life challenges of data bottlenecks, real-world variance, et al. Where we see success, the scope of the Agentic AI is tightly defined and broadly meet the following criteria:

  1. The agent should be able to successfully finish a task most of the time without any human intervention (the exact threshold is a function of the criticality of the application – a customer support agent for troubleshooting a product can have a lower cut off than an agent that is resolving billing issues)
  2. And when the agent is unable to meet the cut-off (i.e. fails to complete),  the failures are bounded, auditable, and reversible.
  3. The failures are identified in the workflow fast enough that they can be routed to a human agent in an elegant manner.

 I have found three key design guidelines that appear to make this work:

  1. A tight scope definition: that can be explicitly fit into the context window or a tightly-bound retrieval layer and prevents the agent from running amok.
  2. Continuous evaluation: with a clearly defined and instrumented metric(s) and a process implemented for monitoring the agent with deterministic circuit-breakers as part of the workflow.
  3. Well-defined orchestration with clear definitions of functions to call, when to loop, and how to back-off on crossing the thresholds in the metrics or other factors  (e.g. rate-limits).

System optimization: Conceptual definition

Broadly speaking, we measure AI systems along two dimensions (simplistic definitions):

  1. How accurate are the AI systems? This is usually measured using a loss function, which  is a measure of the deviation from the model output and the ground truth 
  2. How well are the AI systems learning? This is usually measured by how well the model is learning with feedback (generally speaking, RLHF works by augmenting a reward function to the loss function)

Can we extend this framework to world Human Agents as well to create a standardized framework that can help us make system level decisions? This does sound a bit 1984-esque but I think it is important to address this if we are to be intentional about striking the right balance with AI.

For instance, a holistic optimization function can look something like this:

             U: Total number of task
             H: Human performed
             M: Machine performed (M = U - H)
             Loss function for humans: ∑Lh(x,x’) over H 
             Loss function for machines: ∑Lm(x,x’) over M
             where x is the output and x’ is the ground-truth. 
             Optimization function: min(∑Lh(x,x’) + ∑Lm(x,x’))

Applying these kinds of frameworks for measuring human tasks has the obvious challenge that it puts us on the slippery slope of reducing human work into a series of tasks and robs away the sense of personal agency that we as humans bring to our work everyday. Moreover, this is simplistic – for instance, this needs to factor the very different cost functions (e.g. cost of an error by an AI agent vs. human agent) and the element of human judgment (e.g. tasks which have a fiduciary or legal impact will need a human-in-the-loop). 

At the same time, it is important to build a structured model for task distribution that helps drive towards an optimal allocation of tasks. It is, by and large, generally accepted that Enterprise AI is nowhere close to being a zero sum game. Some tasks will be automated by AI, but the truly successful AI implementations driving sustained business value will be the ones that are able to get the right human-AI balance. And more likely than not, this is going to be a dynamic, evolving system driven by heuristics (trial and error) and empirical evidence (what actually works). 

So, what’s next?

Agentic systems will more than ever need Domain SMEs and AI Engineers, who are able to design and iterate on prompts, vector databases, functions, and put them all together with an orchestration layer. And to do that effectively, it is critical to  develop a formal notion of Quality in all AI projects. And that should involve three building blocks:

  1. A robust Agent Evaluation system that can be used to build an iterative evaluation based on well understood metrics.
  2. An experimentation framework that can be used to evaluate multiple AI system design iterations.
  3. A production monitoring system that can extend the Evaluation metrics to an ongoing process in a production environment.

Clearly, we are just getting started – there is lots more to do in terms of defining these fundamental frameworks that take a holistic view of human and AI agents.

Leave a comment

Blog at WordPress.com.

Up ↑