Posted on Sep 24th, 2024 by Felipe Aníbal
Last updated on Oct 1st, 2024
For the past year, I have dedicated myself to study machine learning and decided to write about some of it here. In the meantime, an opportunity arose to work on a research project and apply what I've learned on a hands-on project. For the next 12 months, I will be a research assistant in a project that studies how to generate database queries from sentences in Portuguese, an application of a topic often called text-to-SQL translation. More specifically, I'll be working on data augmentation techniques for models that perform text-to-SQL translation.
Imagine researchers have collected large volumes of data about public schools in Brazil. The dataset includes invaluable information such as the schools' locations, the size of their facilities, the number of enrolled students, the resources available at each school, and so on. This data can be extremely useful for policy making, helping answer questions like, "How many schools have access to computers?", "Which cities have the fewest schools?", or "What is the average number of students in schools in Feira de Santana?". With some programming, these questions could be answered easily. Programmers can use what is called Structured Query Language to retrieve this information from the database. But what if a system could be built that answered these questions directly, without the need for programming? This is where text-to-SQL comes in.
Text-to-SQL refers to a set of techniques aimed at translating questions in natural language (NL)—a language spoken by humans—into Structured Query Language (SQL), which is used by programmers to retrieve data from a database.
Text-to-SQL translation has evolved significantly in the past five years. The advent of large language models (LLMs) like GPT has had a major impact on the field. However, many challenges remain! There are both challenges related to understanding the NL input and challenges related to producing the correct SQL output. Here are some challenges highlighted in the 2023 survey by George Katsogiannis-Meimarakis and Georgia Koutrika1 that help to see why this is no trivial task.
Natural Language Challenges:
SQL Challenges:
Many models have been developed to address the problems mentioned above. Some of the best-performing algorithms today involve machine learning and the latest large language models (LLMs) - like GPT. These strategies essentially "train" computers to learn from examples. Researchers feed these models with a series of NL questions and their corresponding SQL query. The models then "learn" from this data and are able to predict the best "translation" of a NL question into a SQL query.
These Machine Learning Models require large volumes of data! The more examples these models have to learn from, the more robust they get. Thankfully, alongside the significant technical advances in machine learning algorithms, the amount of available data has increased dramatically in recent years. Datasets called benchmarks have been created and published with the specific purpose of training and evaluating text-to-SQL models. Some of the most well-known benchmarks are BIRD4 and Spider5, and they fueled great progress in the field.
However, building accurate and diverse datasets is costly, and the datasets we have still have limitations. Researchers Karina Fróes2 and Diego Lopes3 highlight that more specific SQL queries, like temporal and geospatial queries, are still underrepresented in datasets for text-to-SQL applications. Additionally, when working with languages other than English, we have very limited data. The lack of training data in these cases hinders the development of robust and accurate models. In theory, if we have more data we can create better models for these specific contexts as well as for languages other than English.
To address these limitations, data augmentation has been explored as a technique and it will be the focus of my research in the coming months. The goal of data augmentation is to generate synthetic data that can be used alongside real data to improve a model's performance. Data augmentation is widely used in other ML fields like image processing and computer vision, and it has shown promising results in the context of text-to-SQL as well6.
One of the goals of this research is to create a system just like described in the beginning of this article. We will apply the techniques we learn from this research on datasets from the CulturaEduca platform, which contains data on culture and education in Brazil, to build an interface that allows users to make questions about a dataset and have them answered with no programming knowledge.
In the coming weeks, I will explore what is already available about data augmentation for text-to-SQL translation in the literature. One of the goals of this research project is to explore different data augmentation techniques, understand their impact, and use them to produce better models for the Portuguese language.
My research will assist Karina Fróes2 and Diego Lopes3 in their work on temporal and geospatial text-to-SQL translation, and I will be mentored by Professor Kelly Braghetto.
Inspired by Otávio Silva I hope to keep this blog updated with the latest advancements in my research in the form of regular "research updates".