Self-referential structs and alternatives
— 8 minToday I want to talk about the need for better tools to work with self-referential structs, and also (safe) alternatives that alleviate that need for certain use-cases.
Lets start by giving a very concrete example of what I want to achieve. I want
to get the nth
line of a string quickly. The idiomatic way to do it would
be string.lines().nth(nth)
, and it can hardly get any simpler than that. The
problem with it is that it is not really quick. It is an O(n)
operation, as
it has to walk the string from beginning up to the end in the worst case when
there is no newline at all. We don’t want to be doing that in a tight loop.
We can trade some memory usage for speed by doing the line splitting once and
caching the result. With that, getting the nth
line becomes a O(1)
operation.
This could be as simple as string.lines().collect::<Vec<_>>()
, or as a complete
example which is literally a piece a code we use in production:
pub struct BorrowedCachedLines<'data> {
lines: Vec<&'data str>,
}
impl<'data> BorrowedCachedLines<'data> {
pub fn new(string: &'data str) -> Self {
Self {
lines: string.lines().collect(),
}
}
pub fn nth(&self, line: usize) -> Option<&str> {
self.lines.get(line).copied()
}
}
The problem with that piece of code is that it has a lifetime, and is thus not
self-contained, and we thus can’t capture it in an async move
future that we
want to spawn.
# Unsafe self-references
As I have shown last time, AsRef
is magic
and we can use it to create self-contained structs. However, our Vec<&'data str>
has to have some lifetime. The closest one to “no lifetime” would be the
'static
lifetime, which is special as we can put a 'static
reference into
a self-contained struct.
Lets try using AsRef<str>
and combine it with a Vec<&'static str>
. This
clearly won’t work, but lets try it anyway to see if the compiler helps us out.
pub struct AsRefUnsafeCachedLines<Buf> {
buf: Buf,
lines: Vec<&'static str>,
}
impl<Buf: AsRef<str>> AsRefUnsafeCachedLines<Buf> {
pub fn new(buf: Buf) -> Self {
let lines = buf.as_ref().lines().collect();
Self { buf, lines }
}
pub fn nth(&self, line: usize) -> Option<&str> {
self.lines.get(line).copied()
}
}
This will give us the following compiler error, which I must admit is confusing and especially the “help” annotation does not help at all.
error[E0310]: the parameter type `Buf` may not live long enough
--> playground\self-referential\src\lib.rs:24:21
|
22 | impl<Buf: AsRef<str>> AsRefUnsafeCachedLines<Buf> {
| ---- help: consider adding an explicit lifetime bound...: `Buf: 'static +`
23 | pub fn new(buf: Buf) -> Self {
24 | let lines = buf.as_ref().lines().collect();
| ^^^ ...so that the type `Buf` is not borrowed for too long
Lets try the “help” anyway and see how that changes things:
error[E0597]: `buf` does not live long enough
--> playground\self-referential\src\lib.rs:24:21
|
24 | let lines = buf.as_ref().lines().collect();
| ^^^^^^^^^^^^
| |
| borrowed value does not live long enough
| argument requires that `buf` is borrowed for `'static`
25 | Self { buf, lines }
26 | }
| - `buf` dropped here while still borrowed
error[E0505]: cannot move out of `buf` because it is borrowed
--> playground\self-referential\src\lib.rs:25:16
|
24 | let lines = buf.as_ref().lines().collect();
| ------------
| |
| borrow of `buf` occurs here
| argument requires that `buf` is borrowed for `'static`
25 | Self { buf, lines }
| ^^^ move out of `buf` occurs here
While having two error messages here does not make too much sense either, at
least they are better than than before and hint at the problem: Our lines
has a lifetime that is tied to buf
, but it is required to have a 'static
lifetime.
What we essentially want to do is to turn our &'data str
into a &'static str
,
which for obvious reasons is not allowed and we need to resort to unsafe code.
We can essentially create a &'static str
“out of thin air” by using from_raw_parts
:
pub struct AsRefUnsafeCachedLines<Buf> {
buf: Buf,
lines: Vec<&'static str>,
}
impl<Buf: AsRef<str>> AsRefUnsafeCachedLines<Buf> {
pub fn new(buf: Buf) -> Self {
let lines = buf
.as_ref()
.lines()
.map(|s| unsafe { make_static(s) })
.collect();
Self { buf, lines }
}
pub fn nth(&self, line: usize) -> Option<&str> {
self.lines.get(line).copied()
}
}
unsafe fn make_static(s: &str) -> &'static str {
let ptr = s.as_ptr();
let len = s.len();
let static_slice: &'static [u8] = std::slice::from_raw_parts(ptr, len);
std::str::from_utf8_unchecked(static_slice)
}
This example works, but the compiler says something very interesting:
warning: field is never read: `buf`
--> playground\self-referential\src\lib.rs:18:5
|
18 | buf: Buf,
| ^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
It is completely right. We never use buf
directly. It only exists to, well,
exist.
As a reminder, this code is extremely unsafe! While we are dealing with &str
in our example, remember that a type such as [u8; N]
implements AsRef<[u8]>
.
I leave it as an exercise for the reader to reproduce the case that moving a
stack allocated [u8; N]
will make you read from a dangling pointer.
Having something along the lines of StableAsRef<T>
would help here. By which
I mean any type that guarantees that the reference returned by as_ref()
does
not change whenever the type itself is moved. Box<[T]>
and Vec<T>
would
implement this, but not [T; N]
. A trait with such a guarantee would make it
possible to write a safe abstraction provided that code behind the abstraction
upholds the safety invariants (as in: not moving/modifying the data in the buffer).
# (Almost) Safe alternatives
The lifetime issues and the need for unsafe code to work around them come from the fact that we are caching Rust references (aka pointers). What if we cache sub-slice offsets into our source buffer instead? Lets try that.
pub struct SafeCachedLines<Buf> {
buf: Buf,
line_offsets: Vec<std::ops::Range<usize>>,
}
impl<Buf: AsRef<str>> SafeCachedLines<Buf> {
pub fn new(buf: Buf) -> Self {
let as_ref = buf.as_ref();
let start_ptr = as_ref.as_ptr();
let line_offsets = as_ref
.lines()
.map(|s| {
let start = unsafe { s.as_ptr().offset_from(start_ptr) as usize };
start..start + s.len()
})
.collect();
Self { buf, line_offsets }
}
pub fn nth(&self, line: usize) -> Option<&str> {
let range = self.line_offsets.get(line)?.clone();
self.buf.as_ref().get(range)
}
}
NOTE: While working on this example, I was a bit surprised there is no safe
function in the standard library for “give me the Range
of a sub-string if it
is contained within self
”. I’m also surprised that offset_from
is unsafe
,
but reading the documentation though makes it clear why.
Anyhow, the added safety means that we have to do an additional slice operation on access with the added overhead of a bounds check.
# Optimizing offsets
Depending on the tradeoffs we want to make, we can even go one step further and
try to optimize our lookup index a bit. A full Range<usize>
has 16 bytes on
64-bit platforms. We can get that down to 4 bytes if we assume our input
buffer is smaller than 4 GiB, which, if we are dealing in plain text is a lot.
The tradeoff here however is some added error handling when constructing the cache, and a few more bounds checks and stripping trailing line terminators when accessing. Here is the full example:
pub struct CompressedCachedLines<Buf> {
buf: Buf,
line_offsets: Vec<u32>,
}
impl<Buf: AsRef<str>> CompressedCachedLines<Buf> {
pub fn new(buf: Buf) -> Result<Self, std::num::TryFromIntError> {
let as_ref = buf.as_ref();
let start_ptr = as_ref.as_ptr();
let line_offsets = as_ref
.lines()
.map(|s| unsafe { s.as_ptr().offset_from(start_ptr).try_into() })
.collect::<Result<Vec<_>, _>>()?;
Ok(Self { buf, line_offsets })
}
pub fn nth(&self, line: usize) -> Option<&str> {
let buf = self.buf.as_ref();
let start = self.line_offsets.get(line).copied()? as usize;
let end = self
.line_offsets
.get(line.checked_add(1)?)
.map(|offset| *offset as usize)
.unwrap_or(buf.len());
let line_buf = buf.get(start..end)?;
let line_buf = line_buf.strip_suffix("\r\n").unwrap_or(line_buf);
let line_buf = line_buf.strip_suffix('\n').unwrap_or(line_buf);
Some(line_buf)
}
}
#[test]
fn compressed() {
let string = String::from("some\r\ntrailing\r\r\nlines");
let cached_lines = CompressedCachedLines::new(string).unwrap();
assert_eq!(cached_lines.nth(0), Some("some"));
assert_eq!(cached_lines.nth(1), Some("trailing\r"));
assert_eq!(cached_lines.nth(2), Some("lines"));
}
Having such a list of line offsets allows us to not only get the nth
line with
O(1)
time complexity (just a constant number of bounds checks, and some
constant-time strip_suffix
), it also allows us a reverse lookup from byte
offset to line number by doing a O(log N)
binary search bounded by the number
of lines we have, which is an added benefit.
# Conclusion
The last example is very close to my real world use-case. I essentially want to
mmap
such indexed files directly from disk. The mentioned bounds checks are
very necessary in that case since I would be dealing with untrusted, possibly
malicious data in that case.