Fun with Arrays in SQL

29 July 2024 — 8 min

Today I went down quite a rabbit hole. I was playing around with native array support in SQLite.

So what exactly does this mean, and why should anyone care?

Lets start this off by looking at SQL, which is a somewhat standardized definition and manipulation language for relational databases. Similar to other foundational technologies in the IT sphere, SQL is quite ancient, as it was designed in the 70s. It is thus lacking a lot of modern features, like a standardized way of working with arrays, or lists, of data.

Last time I talked about SQL on this blog was >5 years ago, when I was experimenting with writing my own strictly typed query builder in TypeScript, inspired by diesel.

Back then, I was already lamenting lack of array support, in particular in combination with the IN(?) operator. What this means is that it is not enough to define a single placeholder for a list of values, but you would have to use N placeholders, one for each value.

This is bad because it pretty much means that you have to do code generation at runtime and you will end up with N unique queries. Queries that the SQL server has to parse, plan, optimize and otherwise deal with.

Ideally, you would want to prepare a query once, and then reuse it often. This should ideally amortize some per-query costs and help performance. If you end up with N unique queries that you are not reusing as often means you are leaving performance on the table.

Apart from queries and usage together with the IN operator, you would also want to use arrays to optimize bulk inserts. Which is what prompted me to research this topic once again. Let us look at example together.

Inserting into a table is done using the INSERT INTO statement. This is how it might look like in Rust for a simple example. All of the code is available in my sql-array repository if you are curious.

let trx = db.transaction()?; // (1)
let mut stmt = trx.prepare("INSERT INTO test (id, name) VALUES (?, ?);")?; // (2)
for row in rows {
    stmt.execute((row.id, &row.name))?; // (3)
}
stmt.finalize()?;
trx.commit()?; // (4)

Here, I start the bulk insertion by opening a transaction (1), and finish everything by calling commit (4). The idea is that the bulk insertion either succeeds as a whole, or it fails in which case nothing will be inserted.

And you can see that I prepare the INSERT statement (2), and then execute it a bunch of times for each row I want to insert (3).

This is very simple, and there is more ways to optimize this to reuse the prepared statement over more bulk inserts. However, this is quite far from being optimal. Even though the transaction is grouping all those inserts together, we are still executing one statement per row. So we can hardly call this bulk insertion.

What a lot of database abstractions are doing is to insert multiple rows per single statement, in which case the statement would look a little bit like this:

INSERT INTO test (id, name) VALUES (?, ?), (?, ?), (?, ?);

… and so on …

You might see the problem already. We have to define each column of each row as a separate placeholder. So we generate an increasing amount of SQL code, which the SQL server has to parse. And then we have to provide a value for each placeholder separately.

There are practical limits to how many placeholders one can define, and there is surely an optimum to how many rows you can bulk insert with a single statement.

I implemented both of these approaches using sqlite and postgresql, and ran a bunch of benchmarks:

benches               fastest
├─ postgres
│  ├─ insert_one
│  │  ├─ 128          3.727 Kitem/s
│  │  ├─ 512          3.564 Kitem/s
│  │  ╰─ 1024         3.436 Kitem/s
│  ├─ insert_batched
│  │  ├─ 128          42.25 Kitem/s
│  │  ├─ 512          78.37 Kitem/s
│  │  ╰─ 1024         104.6 Kitem/s
╰─ sqlite
   ├─ insert_one
   │  ├─ 128          3.339 Mitem/s
   │  ├─ 512          3.41  Mitem/s
   │  ╰─ 1024         3.437 Mitem/s
   ├─ insert_batched
   │  ├─ 128          2.573 Mitem/s
   │  ├─ 512          3.997 Mitem/s
   │  ╰─ 1024         4.174 Mitem/s

I was using a batch-size of 50 within this code, which means I am inserting in chunks for 50 rows at once using a statement that is being reused, plus a second statement with the remainder of elements.

As you can see, inserting rows one-by-one is pretty much flat in both postgres and sqlite, no matter the total number of rows I want to insert. Batch optimization seems to give huge wins in postgres, and some moderate wins in sqlite, but only with an increasing amount of total rows.

SQLite is also 3 orders of magnitude faster than postgres in this benchmark, primarily because it runs completely in process and in memory, whereas postgres was going through the local network onto a docker container.

# Postgres `UNNEST`

About a year ago, I learned that postgresql has native support for arrays, as well as the UNNEST function. I stumbled across a cheat sheet that describes a bunch of bulk operations that are enabled by UNNEST, including the INSERT statement, as well as the IN operator that I started out with.

Using UNNEST for inserts looks like this:

INSERT INTO test (id, name) SELECT * FROM UNNEST($1::INT[], $2::TEXT[]);

Instead of having to dynamically generate SQL code at runtime, this is a completely static statement, with only 2 placeholders. But I can use it to insert however many rows I want.

When you provide more than one placeholder / value to UNNEST, it pretty much acts as a transform similar to the struct of arrays pattern, converting a bunch of homogeneous arrays into multiple rows, using each of the arrays as the column values. Pretty much turning the struct of arrays into an array of structs.

That is quite amazing. My mind was blown when I learned about it, as I was searching for functionality like that for pretty much ever since I started seriously using SQL.

Lets see how this struct of arrays approach works, and whether it provides any speedups:

let query = "INSERT INTO test (id, name) SELECT * FROM UNNEST($1::INT[], $2::TEXT[]);"; // (1)

let ids: Vec<_> = rows.iter().map(|row| row.id).collect(); // (2)
let names: Vec<_> = rows.iter().map(|row| row.name.as_str()).collect();

trx.execute(query, &[&ids, &names])?; // (3)

As you can see, the query is completely static now (1), and I only execute it a single time (3). However, I have to do the array of structs to struct of arrays transformation manually before invoking the query (2).

But running the benchmarks again, we can see that it indeed provides yet another quite impressive speedup in postgres:

benches               fastest
├─ postgres
│  ├─ insert_batched
│  │  ├─ 128          42.25 Kitem/s
│  │  ├─ 512          78.37 Kitem/s
│  │  ╰─ 1024         104.6 Kitem/s
│  ├─ insert_array
│  │  ├─ 128          63.47 Kitem/s
│  │  ├─ 512          206.5 Kitem/s
│  │  ╰─ 1024         321.3 Kitem/s

But, as I hinted at earlier, SQL is only somewhat standardized. The query syntax is about ~90% the same among different SQL servers / implementations, but they wildly differ in other regards, like the data types they support, or the presence of UNNEST, which I have found to be pretty much exclusive to postgresql.

# SQLite `carray`

After a ton of searching around on the web, I finally found something similar to UNNEST, which in sqlite comes in the form of the carray extension. I also found an in-depth article talking about bulk insert performance of sqlite, which also concluded that carray provides the most speedup for bulk inserts.

The rust rusqlite crate comes with its own re-implementation called rarray.

One fundamental problem with carray / rarray however is that it only outputs a single column, and there is no readily available functionality that combines more than one of these columns using the struct of arrays pattern.

The article I linked above is using a CROSS JOIN, which pretty much creates a cross product (N*N) of the two, combined with a join condition that limits the output only to the matching rows, which looks like this:

INSERT INTO test (id, name) SELECT * FROM rarray(?) AS a CROSS JOIN rarray(?) AS b ON a.rowid = b.rowid;

That is quite a mouthful, and it does not really scale that well to more columns either.

Another downside specific to rusqlite is that rarray is internally a Rc<Vec<Value>>, which means that it does not support borrows, and requires all strings to be cloned like so:

let ids: Vec<_> = rows.iter().map(|row| Value::from(row.id)).collect();
let ids = Array::from(ids);

let names: Vec<_> = rows
    .iter()
    .map(|row| Value::from(row.name.clone()))
    .collect();
let names = Array::from(names);

While researching this a little bit, I also found that rusqlite has inefficiency as it also eagerly clones other strings internally, which is being tracked as an issue already.

So how does the performance look like here?

╰─ sqlite
   ├─ insert_batched
   │  ├─ 128          2.573 Mitem/s
   │  ├─ 512          3.997 Mitem/s
   │  ╰─ 1024         4.174 Mitem/s
   ├─ insert_array
   │  ├─ 128          289.2 Kitem/s
   │  ├─ 512          81.15 Kitem/s
   │  ╰─ 1024         38.94 Kitem/s

Uhhh, what? This is significantly slower than batch-inserting, and it gets progressively slower the more rows you have. That is not good. The article I linked above even goes into more detail why that is. The reason is pretty much that the virtual tables backing the rarray do not support random access by rowid, if I read things right.

Thats quite disappointing.

Doing all these experiments, I was digging a bit into the actual code of rarray within rusqlite, and I believe that the virtual table abstractions within sqlite should be powerful enough to even fully support array of structs. So you wouldn’t even need the complex CROSS JOIN, and it should also be possible to avoid having to clone things.

I haven’t fully understood what the exact reasons are why rusqlite is using an Rc and requires owned data, but I believe it should be possible to work with borrowed data as well. But that is a topic for another day.

# What about `IN`?

Last but not least, I also implemented using UNNEST and rarray respectively within an IN operator.

This is what it looks like postgres:

let query = "SELECT id, name from test WHERE id IN (SELECT * FROM UNNEST($1::INT[]));";

let mut rows = vec![];
let mut iter = db.query_raw(query, &[ids])?;
while let Some(raw) = iter.next()? {
    rows.push(row_from_raw(&raw)?);
}
rows.sort();

The complexity in this query comes rather from converting (and sorting) the results. The query itself is trivially simple.

The rusqlite code looks very similar, except the rarray abstraction requires converting the inputs to Value ahead of time.

How does the performance of this code look like?

benches               fastest
├─ postgres
│  ├─ query_array     731.3 µs
│  ╰─ query_one       574.8 µs
╰─ sqlite
   ├─ query_array     9.314 µs
   ╰─ query_one       7.248 µs

Quite disappointing actually. With both databases, using UNNEST / rarray within the IN operator is actually slower than code-generating a query with a dynamic number of placeholders at runtime.

Well, its good to know at least, and this definitely was a fun rabbit hole to go into today.

Fun with Arrays in SQL

# Postgres UNNEST

# SQLite carray

# What about IN?

# Postgres `UNNEST`

# SQLite `carray`

# What about `IN`?