Table

Table itself is an abstraction of two dimension data that can be pulled from various sources. Table spec defines how to pull the data from the source, and how to enrich the data with metadata, such as column type, column name, column description, possible values, max/min values, whether it's a primary key, whether it's a NULL value etc. With those metadata, LLM can better understand the data, thus achieve super high accuracy from natural language to SQL.

Data Source

As of now, we support the following data sources:

  • CSV file
  • Excel file
  • Parquet file
  • JSON file
  • Sqlite file
  • Postgres
  • Mysql
  • Microsoft Sql Server
  • csv/parquet file on S3
  • csv/parquet file on Google Cloud Storage
  • csv/parquet file on Azure Blob Storage

More data sources will be supported in the future. each data source has its unique name in table definition.

Data Select Query

Each table has a data select query that defines how to pull the data from the source. The query can be a simple select query, or a complex query that joins multiple data sources, or a query that calls a stored procedure. ReByte provides a web-based tool that allows you to express the query in a natural language way, and then convert it to SQL.

Simple select query could be just select everything from a table:

select * from <data_source_name>

Complex query could be joining multiple tables and do complex calculation, aggregation, etc:

select a.*, b.* from <data_source_name1> a join <data_source_name2> b on a.id = b.id

Table Materialization

Internally, ReByte materializes the table data into a local file, so that subsequent query can be run without pulling the data from the source again. This has two benefits:

  • accelerate the query speed
  • reduce the cost of pulling data from the source, particularly when the source is on cloud data warehouse which charges based on the outgoing data.

You can control the materialization behavior in the table definition notably: how often to refresh the data

Table Metadata

Each table comes with a set of metadata that enriches the original data. The metadata includes:

  • column type
  • column name
  • user provided description
  • possible values of each column
  • max/min values of each column
  • whether it's a primary key
  • whether it's a NULL value
  • value distribution of each column

Those metadata will be used by LLM agent to better understand the data, thus achieve super high accuracy from natural language to SQL.

Table Composition

Table can be arbitrarily composed to form a virtual database. Then subsequent query can be run on the virtual database. Table name becomes the table name in the virtual database. Here's an example of how to compose a table: suppose you have three tables, sales_number_2023, sales_number_2024, and sales_employee, depending on the use case, you can compose the tables in the following ways:

  • database1: sales_number_2023 + sales_employee
  • database2: sales_number_2024 + sales_employee