Rust is both uniquely good and bad at handling formats

7 February 2025 — 6 min

What do I mean by this clickbait-y title?

# The good

As Rust developers are most likely well aware of, the serde ecosystem is one of the strong points of Rust.

You define your data types just like usual, and with a single line of code, you can add support to turn them into JSON or any other kind of serialization format, or parse them from such a format.

#[derive(Serialize, Deserialize)] is a superpower. It is simple, and it is very efficient as well.

That is, if your serialization format is simple, strict, well defined, and dense.

This might be a bit vague, so let me explain. What I mean is that your format should be as flat as possible, without unnecessary nesting and indirections. Dense in this case means that ideally all the fields are used all the time, and you rarely have to use an Option.

# The bad

The opposite then would be formats that are overly broad, flexible and sparse.

In this case sparse means that the format specifies tens, or even hundreds of possible fields, but you are only using one or two of those in practice.

The format may, for backward compatibility reasons or others, define one field either as a single item, as a list, or as a container that contains a list.

Lets take a look at a couple of examples.

Example one would be the Yahoo Finance API, which is unofficial and undocumented, but quite popular regardless. Getting the OHLCV (open, high, low, close, volume) data for a stock including some corporate actions looks like this: https://query1.finance.yahoo.com/v8/finance/chart/NVD.F?events=div,splits&includePrePost=false&interval=1d&period1=1717192800&period2=1719784800

This gives you a JSON roughly like this:

{
  "chart": {
    "result": [
      {
        "meta": {...},
        "events": {
          "dividends": {
            "1718085600": { "amount": 0.01, "date": 1718085600 }
          },
          "splits": {
            "1717999200": { "date": 1717999200, "numerator": 10.0, "denominator": 1.0, "splitRatio": "10:1" }
          }
        },
        "timestamp": [1717394400, ...],
        "indicators": {
          "quote": [{
            "high": [105.30000305175781, ...],
            "low": [102.80000305175781, ...],
            "open": [102.80000305175781, ...],
            "close": [105.05999755859375, ...],
            "volume": [68560, ...]
          }],
          "adjclose": [{
            "adjclose": [105.03321838378906, ...]
          }]
        }
      }
    ],
    "error": null
  }
}

This format looks straight forward, and contains the data and timeseries we asked for in struct-of-array form, which is nice.

However, the format is also suboptimal for a number of reasons:

A possible error is nested within the outermost object, that has only a single chart field.
The result is a list for no discernible reason?
The events are not defined as a list but as an object/map/dict, with the keys duplicating the date field.
The indicators have another nested object, and the quote is again a list for no discernible reason.
Obviously using floating point numbers for financial data is bad, but I digress.

So to reconstruct the time series in struct-of-array form, I have the following:

timestamps: chart.result[0].timestamp
close: chart.result[0].indicators.quote[0].close

That is a lot of indirection, which is quite easily expressible as a JS-style accessor. Not so much in Rust though.

Expressing that within Rust means that I have to define a struct with a #[derive(Deserialize)] for each of the nested objects, and wrap some of those in Vecs.

In order to access the time series data, I also have to deconstruct those Vecs, which means doing a if let Some(quote) = quote.into_iter().next(), because of course a list can also be empty.

This format mostly has a bunch of indirection, which seems to be unnecessary for my simple use case.

Two other formats I have worked with is the Sentry Event format, and also quite recently the SARIF (Static Analysis Results Interchange Format).

SARIF serves a very good example of overly broad and sparse. It is designed to accommodate all the various static analyzer outputs and features. By doing so, it is also most likely bloated and total overkill for any single static analyzer.

As a side note here, XKCD Standards does come to mind :-D

While working with these formats is not that inconvenient. You do have either a Builder, or the ability to use record update syntax, extending from Default.

A small inconvenience, but not as bad. One problem that both of these example have is that the crates that implement them, sentry-types and serde-sarif respectively, don’t make use of the #[non_exhaustive] attribute.

Which means that extending the format with new fields is a breaking change. This is also one of the reasons why the Sentry crates are at version 0.36, and I don’t really see them hitting 1.0 under these circumstances.

You either use struct literals and record update syntax, or use #[non_exhaustive], but not both, which is a bit of a shame. I wrote about this already almost 5 years ago btw.

# The Ugly

The really ugly part is the hit to the compile times and also binary bloat.

As I used to maintain the Sentry SDK, and still kinda am, I have heard quite a few complaints about the compile times. One of the reasons I gave is that the SDK types are too broad and detailed. They define a ton of fields or variants of some field that the Rust SDK itself does never produce.

But they still exist. We still generate serialization code for those. And also deserialization code, which I don’t think is strictly necessary.

That is a ton of highly optimized deserialization code that needs to be compiled, although we are not using that functionality. We are only using a handful of fields on Event, yet the whole struct weights in at 1232 bytes.

Things are a lot worse for SARIF. The sarif::Run struct has 28 fields, of which only one is required. That field has one level of arguably unnecessary nesting, and has another struct with yet more 28 fields of which also only a single one is required. Doing a mem::size_of::<sarif::Run>() is a whooping 7240 bytes.

That is 7240 bytes for a single tool.driver.name field that we are actually using. The compiler is generating all this specialized de/serialization code which we are effectively not using.

# Pay for what you use?

I wonder if it would be possible to somehow only generate the minimum amount of code, for only those things we are actually using.

If the only thing we want to do is serialize things to JSON, we can surely use serde_json::json!({"some": {"nested": field}}). Which is simpler to type out as a ton of nested structs or builder patterns.

However that macro still creates a serde_json::Value under the hood, which is just a fancy enum with a BTreeMap underneath. It does a bunch of allocations and has indirections, and I haven’t measured how much of a difference it will do in terms of performance compared to defining a very sparse custom struct. But I bet the compile time hit will be less.

In the end, Rust (and the serde ecosystem to some extent) is a blessing as it allows you to with arbitrary data formats in a very convenient and performant way.

But it is also a curse if you have very complex and sparse formats, in which you only use a few percentage of the fields that the format allows for.