Swatinem Blog Resume

Self-referential structs and alternatives

— 5 min

Today I want to talk about the need for better tools to work with self-referential structs, and also (safe) alternatives that alleviate that need for certain use-cases.

Lets start by giving a very concrete example of what I want to achieve. I want to get the nth line of a string quickly. The idiomatic way to do it would be string.lines().nth(nth), and it can hardly get any simpler than that. The problem with it is that it is not really quick. It is an O(n) operation, as it has to walk the string from beginning up to the end in the worst case when there is no newline at all. We don’t want to be doing that in a tight loop.

We can trade some memory usage for speed by doing the line splitting once and caching the result. With that, getting the nth line becomes a O(1) operation.

This could be as simple as string.lines().collect::<Vec<_>>(), or as a complete example which is literally a piece a code we use in production:

pub struct BorrowedCachedLines<'data> {
    lines: Vec<&'data str>,
}

impl<'data> BorrowedCachedLines<'data> {
    pub fn new(string: &'data str) -> Self {
        Self {
            lines: string.lines().collect(),
        }
    }

    pub fn nth(&self, line: usize) -> Option<&str> {
        self.lines.get(line).copied()
    }
}

The problem with that piece of code is that it has a lifetime, and is thus not self-contained, and we thus can’t capture it in an async move future that we want to spawn.

# Unsafe self-references

As I have shown last time, AsRef is magic and we can use it to create self-contained structs. However, our Vec<&'data str> has to have some lifetime. The closest one to “no lifetime” would be the 'static lifetime, which is special as we can put a 'static reference into a self-contained struct.

Lets try using AsRef<str> and combine it with a Vec<&'static str>. This clearly won’t work, but lets try it anyway to see if the compiler helps us out.

pub struct AsRefUnsafeCachedLines<Buf> {
    buf: Buf,
    lines: Vec<&'static str>,
}

impl<Buf: AsRef<str>> AsRefUnsafeCachedLines<Buf> {
    pub fn new(buf: Buf) -> Self {
        let lines = buf.as_ref().lines().collect();
        Self { buf, lines }
    }

    pub fn nth(&self, line: usize) -> Option<&str> {
        self.lines.get(line).copied()
    }
}

This will give us the following compiler error, which I must admit is confusing and especially the “help” annotation does not help at all.

error[E0310]: the parameter type `Buf` may not live long enough
  --> playground\self-referential\src\lib.rs:24:21
   |
22 | impl<Buf: AsRef<str>> AsRefUnsafeCachedLines<Buf> {
   |      ---- help: consider adding an explicit lifetime bound...: `Buf: 'static +`
23 |     pub fn new(buf: Buf) -> Self {
24 |         let lines = buf.as_ref().lines().collect();
   |                     ^^^ ...so that the type `Buf` is not borrowed for too long

Lets try the “help” anyway and see how that changes things:

error[E0597]: `buf` does not live long enough
  --> playground\self-referential\src\lib.rs:24:21
   |
24 |         let lines = buf.as_ref().lines().collect();
   |                     ^^^^^^^^^^^^
   |                     |
   |                     borrowed value does not live long enough
   |                     argument requires that `buf` is borrowed for `'static`
25 |         Self { buf, lines }
26 |     }
   |     - `buf` dropped here while still borrowed

error[E0505]: cannot move out of `buf` because it is borrowed
  --> playground\self-referential\src\lib.rs:25:16
   |
24 |         let lines = buf.as_ref().lines().collect();
   |                     ------------
   |                     |
   |                     borrow of `buf` occurs here
   |                     argument requires that `buf` is borrowed for `'static`
25 |         Self { buf, lines }
   |                ^^^ move out of `buf` occurs here

While having two error messages here does not make too much sense either, at least they are better than than before and hint at the problem: Our lines has a lifetime that is tied to buf, but it is required to have a 'static lifetime.

What we essentially want to do is to turn our &'data str into a &'static str, which for obvious reasons is not allowed and we need to resort to unsafe code.

We can essentially create a &'static str “out of thin air” by using from_raw_parts:

pub struct AsRefUnsafeCachedLines<Buf> {
    buf: Buf,
    lines: Vec<&'static str>,
}

impl<Buf: AsRef<str>> AsRefUnsafeCachedLines<Buf> {
    pub fn new(buf: Buf) -> Self {
        let lines = buf
            .as_ref()
            .lines()
            .map(|s| unsafe { make_static(s) })
            .collect();
        Self { buf, lines }
    }

    pub fn nth(&self, line: usize) -> Option<&str> {
        self.lines.get(line).copied()
    }
}

unsafe fn make_static(s: &str) -> &'static str {
    let ptr = s.as_ptr();
    let len = s.len();
    let static_slice: &'static [u8] = std::slice::from_raw_parts(ptr, len);
    std::str::from_utf8_unchecked(static_slice)
}

This example works, but the compiler says something very interesting:

warning: field is never read: `buf`
  --> playground\self-referential\src\lib.rs:18:5
   |
18 |     buf: Buf,
   |     ^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

It is completely right. We never use buf directly. It only exists to, well, exist.

As a reminder, this code is extremely unsafe! While we are dealing with &str in our example, remember that a type such as [u8; N] implements AsRef<[u8]>. I leave it as an exercise for the reader to reproduce the case that moving a stack allocated [u8; N] will make you read from a dangling pointer.

Having something along the lines of StableAsRef<T> would help here. By which I mean any type that guarantees that the reference returned by as_ref() does not change whenever the type itself is moved. Box<[T]> and Vec<T> would implement this, but not [T; N]. A trait with such a guarantee would make it possible to write a safe abstraction provided that code behind the abstraction upholds the safety invariants (as in: not moving/modifying the data in the buffer).

# (Almost) Safe alternatives

The lifetime issues and the need for unsafe code to work around them come from the fact that we are caching Rust references (aka pointers). What if we cache sub-slice offsets into our source buffer instead? Lets try that.

pub struct SafeCachedLines<Buf> {
    buf: Buf,
    line_offsets: Vec<std::ops::Range<usize>>,
}

impl<Buf: AsRef<str>> SafeCachedLines<Buf> {
    pub fn new(buf: Buf) -> Self {
        let as_ref = buf.as_ref();
        let start_ptr = as_ref.as_ptr();
        let line_offsets = as_ref
            .lines()
            .map(|s| {
                let start = unsafe { s.as_ptr().offset_from(start_ptr) as usize };
                start..start + s.len()
            })
            .collect();
        Self { buf, line_offsets }
    }

    pub fn nth(&self, line: usize) -> Option<&str> {
        let range = self.line_offsets.get(line)?.clone();
        self.buf.as_ref().get(range)
    }
}

NOTE: While working on this example, I was a bit surprised there is no safe function in the standard library for “give me the Range of a sub-string if it is contained within self”. I’m also surprised that offset_from is unsafe, but reading the documentation though makes it clear why.

Anyhow, the added safety means that we have to do an additional slice operation on access with the added overhead of a bounds check.

# Optimizing offsets

Depending on the tradeoffs we want to make, we can even go one step further and try to optimize our lookup index a bit. A full Range<usize> has 16 bytes on 64-bit platforms. We can get that down to 4 bytes if we assume our input buffer is smaller than 4 GiB, which, if we are dealing in plain text is a lot.

The tradeoff here however is some added error handling when constructing the cache, and a few more bounds checks and stripping trailing line terminators when accessing. Here is the full example:

pub struct CompressedCachedLines<Buf> {
    buf: Buf,
    line_offsets: Vec<u32>,
}

impl<Buf: AsRef<str>> CompressedCachedLines<Buf> {
    pub fn new(buf: Buf) -> Result<Self, std::num::TryFromIntError> {
        let as_ref = buf.as_ref();
        let start_ptr = as_ref.as_ptr();
        let line_offsets = as_ref
            .lines()
            .map(|s| unsafe { s.as_ptr().offset_from(start_ptr).try_into() })
            .collect::<Result<Vec<_>, _>>()?;
        Ok(Self { buf, line_offsets })
    }

    pub fn nth(&self, line: usize) -> Option<&str> {
        let buf = self.buf.as_ref();
        let start = self.line_offsets.get(line).copied()? as usize;
        let end = self
            .line_offsets
            .get(line.checked_add(1)?)
            .map(|offset| *offset as usize)
            .unwrap_or(buf.len());

        let line_buf = buf.get(start..end)?;
        let line_buf = line_buf.strip_suffix("\r\n").unwrap_or(line_buf);
        let line_buf = line_buf.strip_suffix('\n').unwrap_or(line_buf);
        Some(line_buf)
    }
}

#[test]
fn compressed() {
    let string = String::from("some\r\ntrailing\r\r\nlines");
    let cached_lines = CompressedCachedLines::new(string).unwrap();
    assert_eq!(cached_lines.nth(0), Some("some"));
    assert_eq!(cached_lines.nth(1), Some("trailing\r"));
    assert_eq!(cached_lines.nth(2), Some("lines"));
}

Having such a list of line offsets allows us to not only get the nth line with O(1) time complexity (just a constant number of bounds checks, and some constant-time strip_suffix), it also allows us a reverse lookup from byte offset to line number by doing a O(log N) binary search bounded by the number of lines we have, which is an added benefit.

# Conclusion

The last example is very close to my real world use-case. I essentially want to mmap such indexed files directly from disk. The mentioned bounds checks are very necessary in that case since I would be dealing with untrusted, possibly malicious data in that case.