Database Access

A tale of API Design

18 June 2019 — 7 min

I had the pleasure recently of looking at a few js/typescript solutions to database access, mostly focused on SQL.

I was tasked with evaluating TypeORM, which is gaining quite some traction, in comparison to knex, which we currently use.

While knex is more focused on being a generic query-builder, typeorm is a fully featured ORM that can model entities and their relation, but with additional functionality for query building.

Some people may know that I’m a big rust fan, even though I still didn’t manage to write any significant amount of production code in that language. But the broader rust community does a lot of things really well, one of them is diesel, the go-to database access and query builder library. I will also mention some things about diesel and what we can learn from it.

# Type Safety

One of the major problems with knex in particular is that it is inherently untyped. Everything is based on strings, which can either break because of typos, or because of refactoring mistakes. Also the return value of a knex query is any by definition, which is really bad. Engineers will have to manually type the result, which of course is prone to bugs and type mismatches.

This is one of the points that is even embedded in the name of typeorm, but it does not quite deliver on its promise. While yes, when you work with decorated entities, you have strict typing for results and for simple find conditions. But once you dig into the lower level querybuilder, things also become stringly-typed, and all bets are off. Not quite what I had hoped for.

# Code Organization

Another problem we have is that our business domain is very broad, and each feature we need to build needs to hook very deeply into the database. Apart from that, there are also challenges around organizing the code itself. Where to put code, if and when to use query builders, how to make the code more maintainable etc. A lot of these problems are also very related to the API that the database access library provides.

Here, both knex and typeorm are coming up short. What I essentially want, is to define an abstract representation of what I want to query, and then execute it. Well for knex, I need an established connection to be able to use the query builder. For typeorm as well, One only gets an entity manager or repository from an established connection. Things are even worse when it comes to transactions. Also, is the connection or transaction per request scoped? I think it is, in order to offer better consistency guarantees. It would certainly be bad if one query hits the master inside of a transaction and a different query hits a read replica. Good luck debugging the resulting problems.

In contrast, I quite like how in diesel, you can statically define your queries, which then have a execute(connection) method. There, the application author has more control over which queries run on which connection.

This plays a lot nicer together with prepared statements. Essentially, the SQL server has to parse a query, run it through the query optimizer and then run it with some parameters. Wouldn’t it be nice if we could re-use the first two steps over a few executions? Just like how the JS JIT engines can optimize some hot code better if it runs more frequently. I’m no expert in neither SQL servers, nor the libraries I am writing about here. But as far as the JS APIs are structured, it seems like they are building a fresh query string each and every time you call the query builder. Which is bad for the performance of JS code in the first place, but I think the SQL side of things could also be improved with a better structure.

# Performance Pitfalls

One problem we face increasingly with the increasing amounts of data we handle is fetching too much data. Here, all three APIs that I mentioned mostly offer the same methods to access your data, lets call them findOne and findAll. None of the APIs has a streaming interface as first class citizen. Which is quite bad, because you can quite easily DDOS a complete service when one query fetches so much data at once that the complete process runs OOM. While I do love diesel in general, it pre-dates rusts async/await story by quite some time, and only offers synchronous calls, which requires you to manually manage a thread pool. But I hope this will all be solved when async/await in rust becomes stable. Also, async iteration is only supported since node 10, and the js projects are also around for quite a longer time before that.

What I would like essentially is an API that at least has first class support for streaming, or even goes one step further and offers only streaming in its base API. Things like findOne and findAll can be built on top of that.

Also, none of the APIs offer a good solution to pagination, but more on that later.

# Do one thing and do it right.

Something else that kind of bothers me, is that both js projects have a far too broad scope. Yes, typeorm is a fully featured ORM, but still both come with solutions for managing migrations, and additional features that a focused project could serve better.

# Dreaming up an ideal API

So with all these things in mind, lets dream up an API that can do better than what we have right now. First things first, lets focus on one thing only: Creating a type safe query builder, which makes it possible to deliver good performance and flexibility for developers.

Already more than half a year ago, I experimented a bit with creating a diesel-like project in typescript called motorina. (Some people will get the reference :-)

Sadly enough, I haven’t had any time or motivation to continue that effort. Its most important goal was to be completely typesafe, and convenient as well. It should make typos impossible and type mismatches impossible.

A second goal was to be high performance. I wanted to create abstract queries on the toplevel scope, which have a strictly typed set of placeholders and a strictly typed result, even with the possibility to define custom type conversions. Since well you know, mysql does not even have a native boolean type, and it does not map well to typescript enums.

Defining abstract queries in the toplevel scope also decouples queries from connections. I don’t really want to care if a query runs on a read-only replicate, or inside a transaction. This should also in theory play very well with prepared statements. You can cache the query on a per-connection basis, and reuse it a thousand times, with different parameters each time.

One problem here is the sql data model itself, which at least for mysql does not allow to actually pass in an array for a IN(?) placeholder. Meh :-(

Apart from the query builder itself, I think a connection wrapper, which exposes a streaming interface as first class citizen, with maybe helpers to just return a single entity might also be a good idea.

# Excursion: Cursor based pagination

I said I will also briefly talk about cursor based pagination, which is not a first class citizen in sql unfortunately. Page-based pagination with limit and offset is the standard. But that does have some problems when the data underneath changes. Cursor-based pagination itself also has some problems, but at least is a bit more stable. I am currently working on generalizing such a cursor based pagination system, but it seems to be quite a bit more complex than initially thought. But I don’t think a generic query builder should itself provide such functionality, but it should be easy to extend it with such functionality.