Fun with Arrays in SQL
— 8 minToday I went down quite a rabbit hole. I was playing around with native array support in SQLite.
So what exactly does this mean, and why should anyone care?
Lets start this off by looking at SQL, which is a somewhat standardized definition and manipulation language for relational databases. Similar to other foundational technologies in the IT sphere, SQL is quite ancient, as it was designed in the 70s. It is thus lacking a lot of modern features, like a standardized way of working with arrays, or lists, of data.
Last time I talked about SQL on this blog was >5 years ago, when I was experimenting with writing my own strictly typed
query builder in TypeScript, inspired by diesel
.
Back then, I was already lamenting lack of array support, in particular in combination with the IN(?)
operator.
What this means is that it is not enough to define a single placeholder for a list of values, but you would have to
use N
placeholders, one for each value.
This is bad because it pretty much means that you have to do code generation at runtime and you will end up with N
unique queries. Queries that the SQL server has to parse, plan, optimize and otherwise deal with.
Ideally, you would want to prepare a query once, and then reuse it often. This should ideally amortize some per-query
costs and help performance. If you end up with N
unique queries that you are not reusing as often means you are leaving
performance on the table.
Apart from queries and usage together with the IN
operator, you would also want to use arrays to optimize bulk inserts.
Which is what prompted me to research this topic once again. Let us look at example together.
Inserting into a table is done using the INSERT INTO
statement. This is how it might look like in Rust for a simple example.
All of the code is available in my sql-array
repository if you are curious.
let trx = db.transaction()?; // (1)
let mut stmt = trx.prepare("INSERT INTO test (id, name) VALUES (?, ?);")?; // (2)
for row in rows {
stmt.execute((row.id, &row.name))?; // (3)
}
stmt.finalize()?;
trx.commit()?; // (4)
Here, I start the bulk insertion by opening a transaction
(1), and finish everything by calling commit
(4).
The idea is that the bulk insertion either succeeds as a whole, or it fails in which case nothing will be inserted.
And you can see that I prepare
the INSERT
statement (2), and then execute it a bunch of times for each row I want to insert (3).
This is very simple, and there is more ways to optimize this to reuse the prepared statement over more bulk inserts. However, this is quite far from being optimal. Even though the transaction is grouping all those inserts together, we are still executing one statement per row. So we can hardly call this bulk insertion.
What a lot of database abstractions are doing is to insert multiple rows per single statement, in which case the statement would look a little bit like this:
INSERT INTO test (id, name) VALUES (?, ?), (?, ?), (?, ?);
… and so on …
You might see the problem already. We have to define each column of each row as a separate placeholder. So we generate an increasing amount of SQL code, which the SQL server has to parse. And then we have to provide a value for each placeholder separately.
There are practical limits to how many placeholders one can define, and there is surely an optimum to how many rows you can bulk insert with a single statement.
I implemented both of these approaches using sqlite
and postgresql
, and ran a bunch of benchmarks:
benches fastest
├─ postgres
│ ├─ insert_one
│ │ ├─ 128 3.727 Kitem/s
│ │ ├─ 512 3.564 Kitem/s
│ │ ╰─ 1024 3.436 Kitem/s
│ ├─ insert_batched
│ │ ├─ 128 42.25 Kitem/s
│ │ ├─ 512 78.37 Kitem/s
│ │ ╰─ 1024 104.6 Kitem/s
╰─ sqlite
├─ insert_one
│ ├─ 128 3.339 Mitem/s
│ ├─ 512 3.41 Mitem/s
│ ╰─ 1024 3.437 Mitem/s
├─ insert_batched
│ ├─ 128 2.573 Mitem/s
│ ├─ 512 3.997 Mitem/s
│ ╰─ 1024 4.174 Mitem/s
I was using a batch-size of 50
within this code, which means I am inserting in chunks for 50
rows at once using a statement
that is being reused, plus a second statement with the remainder of elements.
As you can see, inserting rows one-by-one is pretty much flat in both postgres and sqlite, no matter the total number of rows I want to insert. Batch optimization seems to give huge wins in postgres, and some moderate wins in sqlite, but only with an increasing amount of total rows.
SQLite is also 3 orders of magnitude faster than postgres in this benchmark, primarily because it runs completely in process and in memory, whereas postgres was going through the local network onto a docker container.
#
Postgres UNNEST
About a year ago, I learned that postgresql has native support for arrays, as well as the UNNEST
function. I stumbled across a
cheat sheet that describes a bunch of bulk operations that
are enabled by UNNEST
, including the INSERT
statement, as well as the IN
operator that I started out with.
Using UNNEST
for inserts looks like this:
INSERT INTO test (id, name) SELECT * FROM UNNEST($1::INT[], $2::TEXT[]);
Instead of having to dynamically generate SQL code at runtime, this is a completely static statement, with only 2
placeholders.
But I can use it to insert however many rows I want.
When you provide more than one placeholder / value to UNNEST
, it pretty much acts as a transform similar to the struct of arrays pattern,
converting a bunch of homogeneous arrays into multiple rows, using each of the arrays as the column values. Pretty much turning the struct of arrays
into an array of structs.
That is quite amazing. My mind was blown when I learned about it, as I was searching for functionality like that for pretty much ever since I started seriously using SQL.
Lets see how this struct of arrays approach works, and whether it provides any speedups:
let query = "INSERT INTO test (id, name) SELECT * FROM UNNEST($1::INT[], $2::TEXT[]);"; // (1)
let ids: Vec<_> = rows.iter().map(|row| row.id).collect(); // (2)
let names: Vec<_> = rows.iter().map(|row| row.name.as_str()).collect();
trx.execute(query, &[&ids, &names])?; // (3)
As you can see, the query is completely static now (1), and I only execute it a single time (3). However, I have to do the array of structs to struct of arrays transformation manually before invoking the query (2).
But running the benchmarks again, we can see that it indeed provides yet another quite impressive speedup in postgres:
benches fastest
├─ postgres
│ ├─ insert_batched
│ │ ├─ 128 42.25 Kitem/s
│ │ ├─ 512 78.37 Kitem/s
│ │ ╰─ 1024 104.6 Kitem/s
│ ├─ insert_array
│ │ ├─ 128 63.47 Kitem/s
│ │ ├─ 512 206.5 Kitem/s
│ │ ╰─ 1024 321.3 Kitem/s
But, as I hinted at earlier, SQL is only somewhat standardized. The query syntax is about ~90% the same among different
SQL servers / implementations, but they wildly differ in other regards, like the data types they support,
or the presence of UNNEST
, which I have found to be pretty much exclusive to postgresql.
#
SQLite carray
After a ton of searching around on the web, I finally found something similar to UNNEST
, which in sqlite comes in the
form of the carray
extension.
I also found an in-depth article talking about bulk insert performance of sqlite,
which also concluded that carray
provides the most speedup for bulk inserts.
The rust rusqlite
crate comes with its own re-implementation called rarray
.
One fundamental problem with carray
/ rarray
however is that it only outputs a single column, and there is no readily available functionality
that combines more than one of these columns using the struct of arrays pattern.
The article I linked above is using a CROSS JOIN
, which pretty much creates a cross product (N*N
) of the two, combined with a
join condition that limits the output only to the matching rows, which looks like this:
INSERT INTO test (id, name) SELECT * FROM rarray(?) AS a CROSS JOIN rarray(?) AS b ON a.rowid = b.rowid;
That is quite a mouthful, and it does not really scale that well to more columns either.
Another downside specific to rusqlite
is that rarray
is internally a Rc<Vec<Value>>
, which means that it does not support borrows,
and requires all strings to be cloned like so:
let ids: Vec<_> = rows.iter().map(|row| Value::from(row.id)).collect();
let ids = Array::from(ids);
let names: Vec<_> = rows
.iter()
.map(|row| Value::from(row.name.clone()))
.collect();
let names = Array::from(names);
While researching this a little bit, I also found that rusqlite
has inefficiency as it also eagerly clones other strings internally,
which is being tracked as an issue already.
So how does the performance look like here?
╰─ sqlite
├─ insert_batched
│ ├─ 128 2.573 Mitem/s
│ ├─ 512 3.997 Mitem/s
│ ╰─ 1024 4.174 Mitem/s
├─ insert_array
│ ├─ 128 289.2 Kitem/s
│ ├─ 512 81.15 Kitem/s
│ ╰─ 1024 38.94 Kitem/s
Uhhh, what? This is significantly slower than batch-inserting, and it gets progressively slower the more rows you have.
That is not good. The article I linked above even goes into more detail why that is. The reason is pretty much that
the virtual tables backing the rarray
do not support random access by rowid
, if I read things right.
Thats quite disappointing.
Doing all these experiments, I was digging a bit into the actual code of rarray
within rusqlite
, and I believe that
the virtual table abstractions within sqlite
should be powerful enough to even fully support array of structs.
So you wouldn’t even need the complex CROSS JOIN
, and it should also be possible to avoid having to clone things.
I haven’t fully understood what the exact reasons are why rusqlite
is using an Rc
and requires owned data, but I believe
it should be possible to work with borrowed data as well. But that is a topic for another day.
#
What about IN
?
Last but not least, I also implemented using UNNEST
and rarray
respectively within an IN
operator.
This is what it looks like postgres:
let query = "SELECT id, name from test WHERE id IN (SELECT * FROM UNNEST($1::INT[]));";
let mut rows = vec![];
let mut iter = db.query_raw(query, &[ids])?;
while let Some(raw) = iter.next()? {
rows.push(row_from_raw(&raw)?);
}
rows.sort();
The complexity in this query comes rather from converting (and sorting) the results. The query itself is trivially simple.
The rusqlite
code looks very similar, except the rarray
abstraction requires converting the inputs to Value
ahead of time.
How does the performance of this code look like?
benches fastest
├─ postgres
│ ├─ query_array 731.3 µs
│ ╰─ query_one 574.8 µs
╰─ sqlite
├─ query_array 9.314 µs
╰─ query_one 7.248 µs
Quite disappointing actually. With both databases, using UNNEST
/ rarray
within the IN
operator is actually slower than
code-generating a query with a dynamic number of placeholders at runtime.
Well, its good to know at least, and this definitely was a fun rabbit hole to go into today.