Research Project: Data Augmentation For Text-to-SQL

Posted on Sep 24th, 2024 by Felipe Aníbal

Last updated on Oct 1st, 2024

For the past year, I have dedicated myself to study machine learning and decided to write about some of it here. In the meantime, an opportunity arose to work on a research project and apply what I've learned on a hands-on project. For the next 12 months, I will be a research assistant in a project that studies how to generate database queries from sentences in Portuguese, an application of a topic often called text-to-SQL translation. More specifically, I'll be working on data augmentation techniques for models that perform text-to-SQL translation.

What is text-to-SQL translation?

Imagine researchers have collected large volumes of data about public schools in Brazil. The dataset includes invaluable information such as the schools' locations, the size of their facilities, the number of enrolled students, the resources available at each school, and so on. This data can be extremely useful for policy making, helping answer questions like, "How many schools have access to computers?", "Which cities have the fewest schools?", or "What is the average number of students in schools in Feira de Santana?". With some programming, these questions could be answered easily. Programmers can use what is called Structured Query Language to retrieve this information from the database. But what if a system could be built that answered these questions directly, without the need for programming? This is where text-to-SQL comes in.

Text-to-SQL refers to a set of techniques aimed at translating questions in natural language (NL)—a language spoken by humans—into Structured Query Language (SQL), which is used by programmers to retrieve data from a database.

What are some of the challenges of text-to-SQL?

Text-to-SQL translation has evolved significantly in the past five years. The advent of large language models (LLMs) like GPT has had a major impact on the field. However, many challenges remain! There are both challenges related to understanding the NL input and challenges related to producing the correct SQL output. Here are some challenges highlighted in the 2023 survey by George Katsogiannis-Meimarakis and Georgia Koutrika¹ that help to see why this is no trivial task.

Natural Language Challenges:

Lexical Ambiguity: Words can have multiple meanings depending on the context. For example, "Paris" could refer to the city or to a person's name. The system must resolve these ambiguities to generate accurate SQL queries.
Syntactic Ambiguity: Sentences may have multiple interpretations based on structure. For instance, the question "Find all German movie directors" could refer to German directors or to German-language movies, which leads to different SQL outputs.
Semantic Ambiguity: A sentence can carry different meanings. For example, the query "Are Brad and Angelina married?" could be asking if they are married to each other or to other people.
Context-dependent Ambiguity: Words like "top" or "best" are context-sensitive. In a query like "Who was the best runner of the marathon?", it may refer to the fastest runner, but in another context, "best" could mean the one with the most medals.

SQL Challenges:

SQL Syntax Complexity: SQL has a strict syntax and lacks the expressiveness of natural language. This often results in complex SQL queries for seemingly simple questions, like those involving nested queries or multiple joins.
Database Structure Mismatch: The way users think about data often differs from the database schema. For example, a user might ask for "actresses," but the database might use the term "actor" for both genders, which could cause issues when generating the correct SQL query.
Handling Implicit Joins: Many user queries involve implicit relationships between tables. For example, asking "Who directed the movie 'A Beautiful Mind'?" might require joining several tables in the database, which the system must handle correctly.
Entity Modeling: How entities are modeled in a database can vary greatly. For instance, querying students in a university database differs from querying movie genres in a film database, as some entities are stored across multiple tables, while others are modeled in many-to-many relationships.

The promise of data

Many models have been developed to address the problems mentioned above. Some of the best-performing algorithms today involve machine learning and the latest large language models (LLMs) - like GPT. These strategies essentially "train" computers to learn from examples. Researchers feed these models with a series of NL questions and their corresponding SQL query. The models then "learn" from this data and are able to predict the best "translation" of a NL question into a SQL query.

These Machine Learning Models require large volumes of data! The more examples these models have to learn from, the more robust they get. Thankfully, alongside the significant technical advances in machine learning algorithms, the amount of available data has increased dramatically in recent years. Datasets called benchmarks have been created and published with the specific purpose of training and evaluating text-to-SQL models. Some of the most well-known benchmarks are BIRD⁴ and Spider⁵, and they fueled great progress in the field.

However, building accurate and diverse datasets is costly, and the datasets we have still have limitations. Researchers Karina Fróes² and Diego Lopes³ highlight that more specific SQL queries, like temporal and geospatial queries, are still underrepresented in datasets for text-to-SQL applications. Additionally, when working with languages other than English, we have very limited data. The lack of training data in these cases hinders the development of robust and accurate models. In theory, if we have more data we can create better models for these specific contexts as well as for languages other than English.

What is data augmentation?

To address these limitations, data augmentation has been explored as a technique and it will be the focus of my research in the coming months. The goal of data augmentation is to generate synthetic data that can be used alongside real data to improve a model's performance. Data augmentation is widely used in other ML fields like image processing and computer vision, and it has shown promising results in the context of text-to-SQL as well⁶.

Research applications

One of the goals of this research is to create a system just like described in the beginning of this article. We will apply the techniques we learn from this research on datasets from the CulturaEduca platform, which contains data on culture and education in Brazil, to build an interface that allows users to make questions about a dataset and have them answered with no programming knowledge.

Next steps

In the coming weeks, I will explore what is already available about data augmentation for text-to-SQL translation in the literature. One of the goals of this research project is to explore different data augmentation techniques, understand their impact, and use them to produce better models for the Portuguese language.

My research will assist Karina Fróes² and Diego Lopes³ in their work on temporal and geospatial text-to-SQL translation, and I will be mentored by Professor Kelly Braghetto.

Inspired by Otávio Silva I hope to keep this blog updated with the latest advancements in my research in the form of regular "research updates".

References:

George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal 32, 4 (Jul 2023), 905–936. https://doi.org/10.1007/s00778-022-00776-8
Karina de Carvalho Fróes and Kelly Rosa Braghetto. 2023. Understanding the limitations of Text-To-SQL models: A Brief Analysis of Temporal Queries.
Diego Oliveira Lopes and Kelly Rosa Braghetto. 2023. Expanding Text-to-SQL Capabilities: Towards Comprehensive Geospatial Datasets
Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others. 2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems. https://arxiv.org/pdf/2305.03111
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, Dragomir Radev. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. https://arxiv.org/abs/1809.08887
Ruoxi Sun, Sercan Ö. Arik, Alex Muzio, Lesly Miculicich, Satya Gundabathula, Pengcheng Yin, Hanjun Dai, Hootan Nakhost, Rajarishi Sinha, Zifeng Wang, Tomas Pfister. 2023. SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended). https://arxiv.org/abs/2306.00739