Creating my own bespoke binary format

1 September 2021 — 15 min

# Don’t do it

Well first off the bat, I want to repeat something that I read some time ago:

Don’t create your own bespoke binary format! Use JSON!

I agree with this. Having dealt with binary data quite recently, and seeing how some of the formats are neither well documented, or some of the writers actually creating invalid data, I can truly say this is really hard to get right!

And if you want to have a one-off data format for something, the best bet it to just use JSON for it. I would argue that most applications are not as resource constrained as to not tolerate the overhead that JSON parsing would incur.

Alas, my use-case has some hard constraints. Also most of my learnings come from designing my own binary format for an interesting problem I wanted to solve a few month ago. I haven’t published the result of that work yet, but maybe I will create another post about that in the future.

My general design constraints are like this:

I want the format to be compact
I want it to be zero-copy, as in: to be able to access all the contents from an underlying buffer, without allocating
I want to allow random access

For my current project, I want to avoid using JSON for another reason: I have to use C, and since there is no universal package manager for that language, I simply can’t use serde_json and be done with it. Actually importing a JSON serializer seems like such a hassle that I might as well create my own binary format.

^ This makes me think of how many horrible and crappy formats we have for the only reason that it is way too difficult to actually pull in a JSON parser/serializer into a C-based project. Yes, we did that as well in sentry-native, and we did have unsafe and crashing code in it that we found via fuzzing. Exactly the point I wanted to make, lol.

# Interlude: AoS vs SoA

As a small interlude, just tangentially related to the topic at hand, let me quickly explain the difference between Array of Structs vs Struct of Arrays.

The SoA approach is popular in game development and when using Entity-Component-Systems. It is often used to optimize memory access patterns, thus improving performance, and might also improve memory usage by avoiding wasted padding bytes (clownshoes).

struct AosRoot {
    elements: Vec<Aos>,
}
struct Aos {
    a: A,
    b: B,
}

struct SoaRoot {
    a: Vec<A>,
    b: Vec<B>,
}

I think I will come back to this concept later on when talking about tree-shaped data, lets move on for now.

# Format Compatibility

Apart from not having any kind of format compatibility at all, there are in general two forms of compatibility.

Backwards compatibility means that an up-to-date reader can read data produced by an outdated writer. It means that I can continue to read old legacy versions of the format. I will mostly focus on this, as I control the reading part and can keep it up-to-date and this form of compatibility is easy to implement.

Forwards compatibility means that data produced by an up-to-date writer can be read by an outdated reader. This is a lot more challenging, and also means that changes to the format need to be anticipated, otherwise it is impossible to make it extensible while still being able to support reading a new file with an outdated reader.

# Backwards compatibility via version tagging

The simpler of the two is keeping backwards compatibility to older files with a newer reader. For this, I can simply tag the root of my data format with a version number, and have separate code for reading/interpreting each of the known versions. It is not forwards compatible, as the only thing the reader can do when it encounters an unknown version, all it can reasonably do is to throw a hard error and refuse to parse the format.

The implementation is quite simple, but you will see a lot of unsafe code:

use core::{mem, ptr};

#[repr(C)]
struct Header {
    version: u32,
    num_a: u32,
    num_b: u32,
}

pub struct Format<'data> {
    buf: &'data [u8],
    header: &'data Header,
}

#[repr(C)]
#[derive(Debug, PartialEq, Eq)]
pub struct A(u32);

#[repr(C)]
#[derive(Debug, PartialEq, Eq)]
pub struct B(u32);

impl<'data> Format<'data> {
    pub fn parse(buf: &'data [u8]) -> Self {
        // TODO:
        // * actually verify the version
        // * ensure the buffer is actually valid
        Format {
            buf,
            header: unsafe { &*(buf.as_ptr() as *const Header) },
        }
    }

    pub fn get_as(&'data self) -> &'data [A] {
        let a_start = unsafe { self.buf.as_ptr().add(mem::size_of::<Header>()) as *const A };
        let a_slice = ptr::slice_from_raw_parts(a_start, self.header.num_a as usize);
        unsafe { &*a_slice }
    }

    pub fn get_bs(&'data self) -> &'data [B] {
        let b_start = unsafe {
            self.buf
                .as_ptr()
                .add(mem::size_of::<Header>())
                .add(mem::size_of::<A>() * self.header.num_a as usize) as *const B
        };
        let b_slice = ptr::slice_from_raw_parts(b_start, self.header.num_b as usize);
        unsafe { &*b_slice }
    }
}

#[test]
fn format_works() {
    let buf = [
        1u32, // version
        1,    // num_a
        2,    // num_b
        3,    // a[0]
        4,    // b[0]
        5,    // b[1]
    ];
    let buf = unsafe {
        &*(ptr::slice_from_raw_parts(buf.as_ptr() as *const u8, buf.len() * mem::size_of::<u32>()))
    };

    let parsed = Format::parse(buf);
    assert_eq!(parsed.get_as(), &[A(3)]);
    assert_eq!(parsed.get_bs(), &[B(4), B(5)]);
}

Like I said, a ton of unsafe. There is a lot of pointer arithmetic involved, but the upside is that we are entirely working with borrowed data, and might as well use the code in a #![no_std] environment.

# Tree-structured Data

I think it is also quite easy to support tree- or maybe even graph-shaped data. The previous example showed how to get two types as a complete slice. If the first data type includes a start+len tuple of its child data, we can simply sub-slice the other data based on that. Lets change our above example.

use core::{mem, ptr};

#[repr(C)]
struct Header {
    version: u32,
    num_a: u32,
    num_b: u32,
}

#[repr(C)]
pub struct A {
    own_data: u32,
    start_b: u32,
    num_b: u32,
}

#[repr(C)]
#[derive(Debug, PartialEq, Eq)]
pub struct B(u32);

pub struct Format<'data> {
    buf: &'data [u8],
    header: &'data Header,
}

impl<'data> Format<'data> {
    pub fn parse(buf: &'data [u8]) -> Self {
        // TODO:
        // * actually verify the version
        // * ensure the buffer is actually valid
        Format {
            buf,
            header: unsafe { &*(buf.as_ptr() as *const Header) },
        }
    }

    pub fn get_as(&'data self) -> &'data [A] {
        let a_start = unsafe { self.buf.as_ptr().add(mem::size_of::<Header>()) as *const A };
        let a_slice = ptr::slice_from_raw_parts(a_start, self.header.num_a as usize);
        unsafe { &*a_slice }
    }

    fn get_bs_for(&'data self, a: &'data A) -> &'data [B] {
        let b_start = unsafe {
            self.buf
                .as_ptr()
                .add(mem::size_of::<Header>())
                .add(mem::size_of::<A>() * self.header.num_a as usize) as *const B
        };
        let b_slice = ptr::slice_from_raw_parts(b_start, self.header.num_b as usize);
        let range = a.start_b as usize..a.start_b as usize + a.num_b as usize;
        unsafe { (&*b_slice).get_unchecked(range) }
    }
}

#[test]
fn format_works() {
    let buf = [
        1u32, // version
        2,    // num_a
        3,    // num_b
        3,    // a[0].own_data
        0,    // a[0].start_b
        1,    // a[0].num_b
        4,    // a[1].own_data
        1,    // a[1].start_b
        2,    // a[1].num_b
        1,    // a[0].bs[0]
        2,    // a[1].bs[0]
        3,    // a[1].bs[1]
    ];
    let buf = unsafe {
        &*(ptr::slice_from_raw_parts(buf.as_ptr() as *const u8, buf.len() * mem::size_of::<u32>()))
    };

    let parsed = Format::parse(buf);
    let r#as = parsed.get_as();
    assert_eq!(r#as.len(), 2);
    assert_eq!(r#as[0].own_data, 3);
    assert_eq!(parsed.get_bs_for(&r#as[0]), &[B(1)]);
    assert_eq!(r#as[1].own_data, 4);
    assert_eq!(parsed.get_bs_for(&r#as[1]), &[B(2), B(3)]);
}

This did get a little bit more difficult, but not terribly so. Still a lot of pointer arithmetic and unsafe code in there. One inconvenience here is that you need to connect the Parser holding the raw buffer to our returned parent data.

# Making things safe

So we have a quite nice format, parsing essentially comes down to pointer arithmetic and a few casts. And everything without allocating. But how can we make it safe? By that I don’t mean to remove all the unsafe blocks, but to rather ensure that whatever happens inside the unsafe blocks is actually safe.

Well, we can make things memory safe by making sure that the pointers are properly aligned, and that we have appropriate padding in between our raw arrays. The next obvious thing to do it to make sure that we don’t read out of bounds. Luckily Rust makes this super easy to for example do checked sub-slicing. We can bounds check our Header in a way to ensure that the numbers of embedded Data does not go out of bounds of the buffer. Similarly, all the cross-struct references need to be similarly bounds-checked.

This I think should take care of all the unsafety while accessing the data. The problem remains that we simply cast random bytes. The validity of the actual data that we point to depends on the correctness of the writer. CRC-ing the whole buffer (minus version number and CRC-sum itself) might be another idea to at least guard against bitflips, but does not guard against writer correctness. I guess if you manipulate the raw data in a certain way, and we have to remember, we are dealing with user-provided data here, so we must keep it in mind that some malicious data will fly our way.

# Disadvantages

We have seen one inconvenience above already. We need access to the raw buffer when we want to access referenced data. There is a simple solution to this, and that is to separate the raw data structures from the public API. The public API would use an iterator that yields stack-allocated copies of our public structs, still #![no_std] compatible, unless the user decides to collect those into a heap allocated Vec.

pub struct PublicA<'data> {
    format: &'data Format<'data>,
    raw_a: &'data A,
}

impl<'data> PublicA<'data> {
    pub fn get_bs(&'data self) -> &'data [B] {
        self.format.get_bs_for(self.raw_a)
    }
}

impl<'data> Format<'data> {
    pub fn iter_a(&'data self) -> impl Iterator<Item = PublicA<'data>> {
        self.get_as().iter().map(move |raw_a| PublicA {
            format: self,
            raw_a,
        })
    }
}

let mut iter_a = parsed.iter_a();
assert_eq!(iter_a.next().unwrap().get_bs(), &[B(1)]);
assert_eq!(iter_a.next().unwrap().get_bs(), &[B(2), B(3)]);

We can have a very ergonomic and still very efficient non-allocating API this way.

One problem though remains. The writer side of this might not look as nice. We cannot directly write the format in one go and stream it into a file. Instead, we have to allocate intermediate A and B vectors, get relative indices for the child Bs, and then write these different buffers one after the other into a file.

# Forwards compatibility via sizeof tagging

To re-iterate, being forwards compatible means that an outdated reader can still read a newer file format. One way I found out while looking at the minidump format, and some of the Windows APIs is that structures are tagged with their own size_of values. So even though an outdated writer cannot interpret the newly added fields, it at least knows how many bytes to skip. Random access still works, though directly returning typed slices would not I assume. Good thing Rust has Iterators ;-)

One more interesting thing is that some Windows APIs work this way. You call them with the size_of of a stack allocated out-parameter. The API can return (well, write into an out-parameter) different versions of a data type, so the same function can be used from applications targeting the older version of the Windows API, while newer applications get extended data.

It works, but it is a pain in the rear to use. At least from C. I think with Rust it might be possible to write ergonomic wrappers for such an API, though I haven’t looked or tried myself.

Lets go on and try to implement this kind of format:

use core::{mem, ptr};

#[repr(C)]
struct Header {
    version: u32,
    sizeof_header: u32,
    sizeof_a: u32,
    sizeof_b: u32,
    num_a: u32,
    num_b: u32,
}

#[repr(C)]
struct RawA {
    own_data: u32,
    start_b: u32,
    num_b: u32,
}

#[repr(C)]
#[derive(Debug, PartialEq, Eq)]
pub struct B(u32);

pub struct Format<'data> {
    buf: &'data [u8],
    header: &'data Header,
}

impl<'data> Format<'data> {
    pub fn parse(buf: &'data [u8]) -> Self {
        // TODO:
        // * actually verify the version
        // * ensure the buffer is actually valid
        Format {
            buf,
            header: unsafe { &*(buf.as_ptr() as *const Header) },
        }
    }

    fn get_a(&'data self, idx: usize) -> &'data RawA {
        let a_start = unsafe { self.buf.as_ptr().add(self.header.sizeof_header) };
        let a = unsafe { a_start.add(self.header.sizeof_a as usize * idx) };
        unsafe { &*(a as *const RawA) }
    }

    fn get_b(&'data self, idx: usize) -> &'data B {
        let b_start = unsafe {
            self.buf
                .as_ptr()
                .add(self.header.sizeof_header)
                .add(self.header.sizeof_a as usize * self.header.num_a as usize)
        };
        let b = unsafe { b_start.add(self.header.sizeof_b as usize * idx) };
        unsafe { &*(b as *const B) }
    }
}

pub struct PublicA<'data> {
    format: &'data Format<'data>,
    raw_a: &'data RawA,
}

impl<'data> PublicA<'data> {
    pub fn iter_b(&'data self) -> impl Iterator<Item = &'data B> {
        let raw_a = self.raw_a;
        (raw_a.start_b as usize..raw_a.start_b as usize + raw_a.num_b as usize)
            .map(move |idx| self.format.get_b(idx))
    }
}

impl<'data> Format<'data> {
    pub fn iter_a(&'data self) -> impl Iterator<Item = PublicA<'data>> {
        (0..self.header.num_a as usize).map(move |idx| PublicA {
            format: self,
            raw_a: self.get_a(idx),
        })
    }
}

#[test]
fn format_works() {
    let buf = [
        1u32, // version
        28,   // sizeof_header
        16,   // sizeof_a
        8,    // sizeof_b
        2,    // num_a
        3,    // num_b
        123,  // 🤷🏻‍♂️
        3,    // a[0].own_data
        0,    // a[0].start_b
        1,    // a[0].num_b
        123,  // 🤷🏻‍♂️
        4,    // a[1].own_data
        1,    // a[1].start_b
        2,    // a[1].num_b
        123,  // 🤷🏻‍♂️
        1,    // a[0].bs[0]
        123,  // 🤷🏻‍♂️
        2,    // a[1].bs[0]
        123,  // 🤷🏻‍♂️
        3,    // a[1].bs[1]
        123,  // 🤷🏻‍♂️
    ];
    let buf = unsafe {
        &*(ptr::slice_from_raw_parts(buf.as_ptr() as *const u8, buf.len() * mem::size_of::<u32>()))
    };

    let parsed = Format::parse(buf);

    let mut iter_a = parsed.iter_a();
    let a = iter_a.next().unwrap();
    let mut iter_b = a.iter_b();
    assert_eq!(iter_b.next().unwrap(), &B(1));
    let a = iter_a.next().unwrap();
    let mut iter_b = a.iter_b();
    assert_eq!(iter_b.next().unwrap(), &B(2));
    assert_eq!(iter_b.next().unwrap(), &B(3));
}

Well, this was easy enough! Thanks to Rusts impl Iterator, this also happens to be very ergonomic. One disadvantage is that we can’t use slices directly, and we would need to always go through the iterators, and allocate and collect anytime we want a Vec or a slice.

# Dynamically sized Data

Let me emphasize this very clearly: Don’t do this! I’m serious!

Some data formats don’t use an array of equally-sized objects, but rather dynamically sized objects. Maybe they have string-data embedded, maybe they have a type tag which says how large the structure is. There are two major problems with this.

First, you can’t randomly-access the n-th item, you have to parse them one-by-one.

The second problem is that instead of using indices to cross-reference data, you rather use byte offsets. To be honest, this is a bit simpler than relying on pointer arithmetic. But it has one critical shortcoming. It relies on the correctness of the writer.

I recently debugged a problem with the PDB format that uses dynamically-sized data, and relative virtual addressing (RVA) to cross-reference entries. Well, one of these references was wrongly aligned, and reading/casting that reference yielded garbage. While looking at this problem, I found out that all these wrong entries had one thing in common, but I still haven’t found out why they were corrupt, or how to fix them. If anyone is curious, the case I’m referring to is documented here.

Please don’t do this, avoid dynamically-sized data as much as possible.

# Embedded Tree-structured Data

Even worse than having dynamically-sized data at all is to have it for nested tree-structured data.

Remember how dynamically-sized data means we can’t have random access. Well, with tree-structured dynamic data we made things even worse, as we would have to walk and parse a complete sub-tree in order to advance. Well, that is a pain. We could kind-of solve this problem by having separate sizeof_self and sizeof_self_including_subtree sizes, so we can use the latter to skip over the subtree without walking it.

Its doable, but we are trading storage space for parsing performance. Not to mention that we are already wasting a lot of storage space to either version-tag or sizeof-tag every single entity.

# Conclusion

I guess we have come to the end, and I am actually super happy about having written down all my thoughts on this topic, and even implemented a few ideas in runnable Rust code (though missing all the bounds-checks).

So in summary, if you can’t use JSON for your data, and absolute need to use a custom binary format, here are some tips and observations:

Use the Struct of Arrays pattern, and give each of your entities its own Array.
Cross-reference entities via indices into those arrays, or sub-slice via index+len.
If you care about forwards-compatibility, use sizeof-tagging, so that outdated readers know how to skip data they don’t know how to interpret.
If going with Iterators, use separate raw vs public structs for a nice API.
Avoid dynamically-sized data at all costs, especially tree-structured.

And just as a finishing rant, just keep your damn software up-to-date! Even though the forwards-compatible version didn’t turn out to be half as bad as I had imagined. Even though I have full control over the reader for my use-case, I can imagine a scenario where we would have to roll-back a reader upgrade, and might end up in a situation where the reader is outdated compared to the writer. So I might as well end up using the sizeof-tagging approach.