Swatinem Blog Resume

The magic of AsRef

— 2 min

Both at work, and also personally, I do think about efficient parsers and data formats a lot. Some time ago, I also wrote an article about writing a custom binary format and associated parser. That exercise started something like this:

#[repr(C)]
struct Header {
    version: u32,
    num_a: u32,
    num_b: u32,
}

pub struct Format<'data> {
    buf: &'data [u8],
    header: &'data Header,
}

impl<'data> Format<'data> {
    pub fn parse(buf: &'data [u8]) -> Self {
        // TODO:
        // * actually verify the version
        // * ensure the buffer is actually valid
        Format {
            buf,
            header: unsafe { &*(buf.as_ptr() as *const Header) },
        }
    }
}

While this works perfectly fine, and the Format is truly zero-copy, it does have one major drawback. It has the lifetime parameter 'data, and is thus not 'static. I can’t capture it by an async move closure and tokio::spawn it. Also for reasons that I must admit I don’t fully understand, trait objects also always carry an explicit 'static bound on them. Well, although now thinking about this again, is becomes a bit more obvious to me. If I want to package up a callback function into a struct of mine that does not carry a lifetime itself, I have to use a Box<dyn Fn() + 'static> or equivalent container.

Either way, for various reasons, we want to have fully “self-owned” types that are 'static, and our example Format above is not self-contained.

There are a couple of different approaches to this, but what I have found as the go-to solution which offers the most flexibility to API users might be to use AsRef<T>, and in our specific case AsRef<[u8]>, so lets try to use that.

Without further ado, here is the finished demo code, along with tests that ensure things work as intended, and that our final Format is indeed 'static. We can use any kind of underlying buffer type, no matter if its an array, a Vec, a Cow or a memory mapped file, as long as it implements AsRef<[u8]>.

use core::{mem, ptr};

#[repr(C)]
#[derive(Clone, Copy)]
struct Header {
    version: u32,
    num_a: u32,
    num_b: u32,
}

pub struct Format<Buf> {
    buf: Buf,
    header: Header,
}

#[repr(C)]
#[derive(Debug, PartialEq, Eq)]
pub struct A(u32);

#[repr(C)]
#[derive(Debug, PartialEq, Eq)]
pub struct B(u32);

impl<Buf: AsRef<[u8]>> Format<Buf> {
    pub fn parse(buf: Buf) -> Self {
        // TODO:
        // * actually verify the version
        // * ensure the buffer is actually valid
        let header = unsafe { *(buf.as_ref().as_ptr() as *const Header) };
        Format { buf, header }
    }

    pub fn into_inner(self) -> Buf {
        self.buf
    }

    pub fn get_as(&self) -> &[A] {
        let a_start =
            unsafe { self.buf.as_ref().as_ptr().add(mem::size_of::<Header>()) as *const A };
        let a_slice = ptr::slice_from_raw_parts(a_start, self.header.num_a as usize);
        unsafe { &*a_slice }
    }

    pub fn get_bs(&self) -> &[B] {
        let b_start = unsafe {
            self.buf
                .as_ref()
                .as_ptr()
                .add(mem::size_of::<Header>())
                .add(mem::size_of::<A>() * self.header.num_a as usize) as *const B
        };
        let b_slice = ptr::slice_from_raw_parts(b_start, self.header.num_b as usize);
        unsafe { &*b_slice }
    }
}

#[test]
fn format_works() {
    use std::borrow::Cow;
    fn is_static<T: 'static>(_: &T) {}

    let array_buf: [u8; 24] = [
        // there are all little-endian:
        1, 0, 0, 0, // version
        1, 0, 0, 0, // num_a
        2, 0, 0, 0, // num_b
        3, 0, 0, 0, // a[0]
        4, 0, 0, 0, // b[0]
        5, 0, 0, 0, // b[1]
    ];

    let parsed: Format<[u8; 24]> = Format::parse(array_buf);
    is_static(&parsed);
    assert_eq!(parsed.get_as(), &[A(3)]);
    assert_eq!(parsed.get_bs(), &[B(4), B(5)]);

    let vec_buf = Vec::from(array_buf);
    let parsed: Format<Vec<_>> = Format::parse(vec_buf);
    is_static(&parsed);
    assert_eq!(parsed.get_as(), &[A(3)]);
    assert_eq!(parsed.get_bs(), &[B(4), B(5)]);
    let vec_buf = parsed.into_inner();

    let cow_buf: Cow<'static, [u8]> = Cow::Owned(vec_buf);
    let parsed: Format<Cow<_>> = Format::parse(cow_buf);
    is_static(&parsed);
    assert_eq!(parsed.get_as(), &[A(3)]);
    assert_eq!(parsed.get_bs(), &[B(4), B(5)]);

    let slice_buf: &[u8] = &array_buf;
    let parsed: Format<&[u8]> = Format::parse(slice_buf);

    // is_static(&parsed);
    // ^ this would fail with:
    // error[E0597]: `array_buf` does not live long enough
    //   --> playground/asref/src/lib.rs:89:28
    //    |
    // 89 |     let slice_buf: &[u8] = &array_buf;
    //    |                            ^^^^^^^^^^
    //    |                            |
    //    |                            borrowed value does not live long enough
    //    |                            cast requires that `array_buf` is borrowed for `'static`
    // ...
    // 94 | }
    //    | - `array_buf` dropped here while still borrowed

    assert_eq!(parsed.get_as(), &[A(3)]);
    assert_eq!(parsed.get_bs(), &[B(4), B(5)]);
}

The one shortcoming that this format has though is that it is not fully zero-copy anymore. The parse() method does copy the header bytes. In order not to do that, we would need to have better (and safe) ways to declare self-referencial structs. But that is a topic for another post ;-)