I have always enjoyed Thanksgiving family get togethers because among other things, gives me a chance to catch up with what the next generation is up to – always get to learn a thing or two from them. And this time, enjoyed talking to my niece’s husband who thinks very deeply about data privacy (working towards a PhD in that area) and we got talking about the evolving nature of privacy and trust with AI. While he is coming at it from a policy framework lens, it struck me that: 1/Enterprise customers are also thinking about the same issues, even if it is from the specific lens of measuring risk and liability. 2/These issues have been around for a while with ML systems but with AI, these feel different and understandably so, especially when expand the scope of automated decision making. For instance, ML algorithms like credit card fraud detection have gained wide-spread adoption partly because the monitoring have gotten extremely good through a series of incremental improvements over the years. AI seems to have come upon as all of a sudden, which is why most companies are scrambling to find bottom-up technical solutions even as the regulatory and compliance policies are evolving. And this goes beyond the (reductive narrative) of ‘LLM hallucination’ and is forcing companies to take an end-to-end systems approach. I want to dive-deep into one such AI evaluation system today.
Recently, I met two customer personas from different industries: CPG (Head of Procurement and Supply Chain) and a healthcare data startup (CTO). Both of them had a similar ask: how can we provide a natural language interface for users to ask questions of structured and unstructured data? As more companies look to deploy GenAI use-cases to meaningfully improve employee productivity with decision augmentation tools, there is increasing interest in combining the conversational, and reasoning capabilities from LLMs with the scalable computational power of data management systems.
Scenario: A buyer is going to a contract renewal negotiation with a supplier. One of the key questions they would have: ‘What are the existing contract terms that turned out unfavorably for the company?’ Conceptually, this would need to go through the following steps:
1/ Extract the terms and conditions from the contract (typically, a document) and identify the ones that would have material impact (e.g. Account Payable terms, volume discount tiers etc.). LLMs do a pretty good job at search (semantic) and interacting with the buyers to quickly narrow down the relevant conditions for investigation, especially when the models are grounded with the right prompts and RAG.
2/ For each of the conditions, the next step is to translate them into SQL queries to extract the relevant data from the procurement data lakes. This is non-trivial – on the widely respected Bird-bench benchmark leaderboard the top-performer is at 75% accuracy (vs. 93% accuracy from data engineers). Clearly, AI systems are not yet good enough, even more so when the financial implications could be significant.
There is interesting work being done in this area – most recently, this paper from UC Berkeley/Stanford that proposes a ‘Table Augmented Generation (TAG) framework for unifying LLMs and databases for natural language queries. While these methods continue to evolve, we will also need to deploy evaluation frameworks that are specific to enterprise deployments. Here’s one such blueprint:
Evaluation
- Evaluation against cross-domain datasets (e.g. bird-bench) is an important starting point, but note that this is on databases across multiple domains (e.g. blockchain, healthcare etc.), while the enterprise requirements are domain specific.
- Evaluating against domain data by using historical question-SQL pairs (‘ground truth’) These questions are then run through the text-2-SQL application and then compared against ground truth queries on two dimensions:
- Syntactic matching: Compare the structure and syntax of the generated SQL statements with the ground truth queries
- Execution matching: Assess whether the generated SQL statements, when executed, produce the same results as the ground truth querie`s. Execution matching can be done on multiple metrics, for example
- Row count: number of rows returned
- Content: actual values returned (e.g. checksum, random match)
The ‘ground-truth’ question-SQL pairs need to be generated by Subject Matter Experts and compiled over time by capturing questions and the corresponding SQL queries. (See Figure)

Deployment
Deploying this capability into production could take a phased approach:
- Wave-1: Onboard ‘power-users’ who are capable of evaluating the quality of the generated SQL. This could be done in a 3-step process
- Evaluate tables/views identified by the LLM against the relevant ones for the questionEvaluate the queries built by the LLM for syntactic correctness and business logic
- Evaluate the queries built by the LLM for execution correctness
- Wave-2: Onboard regular users with the ability to provide feedback to the responses for their questions, and record the question-responses. The ones that are not accepted by the end-users, can be tagged and routed to a human ‘power-user’ to generate the correct query. On an ongoing basis, correctness should be tracked as an ongoing metric.
From an implementation point of view, this needs to be part of the overall AI evaluation and governance process, as a downstream process to the enterprise-level guardrails (e.g. privacy) and governance (e.g. permissions) implementation. Just as we define thresholds for ML-based transaction fraud detection, we will have to iteratively refine the syntactic and execution matching criteria. We have recommended this to both the customers that I mentioned above and am eager to see how this evolves. As they say, this is early days yet and as this area continues to evolve at a breakneck speed, I am looking forward a much deeper conversation around AI trust at next year’s Thanksgiving!
Leave a comment