Enhancing Text-to-SQL With Synthetic Summaries

LLMs are being experimented with to do so many things today, and one of the use cases that sound compelling is getting their help to generate insights from data. What if you could find the answer to your data question without begging a data analyst in your company? But this is easier said than done. To perform this task properly, LLMs need to know about your datasets, the tables, their schemas, and values stored in them. You can provide this information in the prompt itself if your dataset is tiny, but this is not possible in most real life scenarios, since the information will be huge and either it won’t fit the LLM’s context knowledge or it will be very expensive and not feasible.

A workaround for this problem is to use RAG to provide LLM with a few relevant sql queries that it can use to understand the tables and data structure to respond to user’s query. But again, this is easier said than done, since typical semantic and keyword based retrieval techniques don’t work well enough for code and sql queries.

The team at Timescale explain a clever trick for building text-to-sql using LLMs that aims to solve this limitation by creating synthetic summaries for sample SQL queries. The generated summaries then can be used in the retrieval process, matching user’s query with the most relevant sample SQL queries.

Here’s the basic idea:

We take existing SQL queries and create detailed summaries of what they do.

We use these summaries, rather than the raw SQL, when trying to match user questions to relevant queries.

We can generate more of these summaries as needed, allowing us to improve our system continually.

Synthetic summaries: We use AI to create detailed summaries of tables and SQL code. These summaries help us understand the information better than just looking at keywords. Retrieval augmented in-context learning: Our system employs a two-step process to enhance SQL generation:

We first identify relevant tables and snippets.

The retrieved tables and code examples are used to provide context-rich instructions to our large language model (LLM), enabling it to better understand and apply specific business rules when generating SQL queries.

… this approach has shown some promising results:

It improved accuracy in finding the right information from 81 % to 90 %.

It maintained good performance even when we added irrelevant data.

It seemed better at capturing business-specific logic than traditional methods.

source

Comment? Reply via Email, Bluesky or Twitter.

Enhancing Text-to-SQL With Synthetic Summaries

Join the Newsletter

Related Posts