Optimizing Rust Enum `Debug`-ing with Perfect Hashing
— 8 minThis weekend is the start of my week of chill at home and do nothing vacation. Which, for a passionate software engineer, is the perfect time to do some open source work and dive into interesting topics outside of work.
The topic I will be looking at is optimizing code generation of Rust enums. This deep dive is motivated by a real world
issue in the rust minidump(_common)
crate.
The minidump_common
crate defines two gigantic C-style enums that look a little bit like this:
#[repr(u32)]
#[derive(Copy, Clone, PartialEq, Eq, Debug, FromPrimitive)]
pub enum WinErrorWindows {
ERROR_SUCCESS = 0,
ERROR_INVALID_FUNCTION = 1,
ERROR_FILE_NOT_FOUND = 2,
ERROR_PATH_NOT_FOUND = 3,
// ... about ~2_000 more variants
}
It is a data-less enum with an explicit u32
discriminant, the discriminant values are explicitly assigned,
and the discriminant values are non-contiguous, meaning they have gaps in them.
The problem with this enum is that the Debug
and FromPrimitive
derives create a ton of bloat.
How much you ask? Well, according to cargo bloat
, these two derives are among the top offenders compiling
symbolicator
, any symbolicator
is a huge crate:
File .text Size Crate Name
0.3% 0.5% 108.8KiB minidump_common <minidump_common::errors::windows::WinErrorWindows as core::fmt::Debug>::fmt
0.2% 0.3% 62.0KiB minidump_common <minidump_common::errors::windows::WinErrorWindows as num_traits::cast::FromPrimitive>::from_u64
Another indicator of this bloat is cargo llvm-lines
, which has the following to say about our enum:
Lines Copies Function name
----- ------ -------------
9852153 271193 (TOTAL)
73972 (0.8%, 1.5%) 6184 (2.3%, 3.4%) <&T as core::fmt::Debug>::fmt
2832 (0.0%, 41.3%) 1 (0.0%, 25.6%) <minidump_common::errors::windows::WinErrorWindows as core::fmt::Debug>::fmt
2830 (0.0%, 41.4%) 1 (0.0%, 25.6%) <minidump_common::errors::windows::WinErrorWindows as num_traits::cast::FromPrimitive>::from_u64
The cargo llvm-lines
output does not flag it as such a huge offender, and at ~2000 lines, there is
barely any improvements to be had, as that is fairly close to the number of variants this enum has.
#
Debug
-ing enum codegen
Lets break this whole problem down and start from the beginning with a dead simple enum.
#[derive(Debug)]
pub enum Enum {
A,
B,
C,
D,
}
pub fn fmt(e: Enum) -> String {
format!("{e:?}")
}
Nothing out of the ordinary, and we can use this example to illustrate a couple of things.
First, lets see what kind of Rust code is being generated on behalf of that #[derive(Debug)]
.
We can do that directly on the Rust Playground.
Just choose expand macros from the tools dropdown. This is what it looks like:
pub enum Enum { A, B, C, D, }
#[automatically_derived]
impl ::core::fmt::Debug for Enum {
fn fmt(&self, f: &mut ::core::fmt::Formatter) -> ::core::fmt::Result {
::core::fmt::Formatter::write_str(f,
match self {
Enum::A => "A",
Enum::B => "B",
Enum::C => "C",
Enum::D => "D",
})
}
}
Very straight forward, no surprises here.
When this is being compiled into native code by LLVM, and checking the output in the Compiler explorer, I got my first positive shock, as this is its output:
<example::Enum as core::fmt::Debug>::fmt:
mov rax, rsi
movzx ecx, byte ptr [rdi]
lea rdx, [rip + .Lreltable.<example::Enum as core::fmt::Debug>::fmt]
movsxd rsi, dword ptr [rdx + 4*rcx]
add rsi, rdx
mov edx, 1
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.Lreltable.<example::Enum as core::fmt::Debug>::fmt:
.long .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_4-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_5-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_6-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.L__unnamed_3:
.byte 65
.L__unnamed_4:
.byte 66
.L__unnamed_5:
.byte 67
.L__unnamed_6:
.byte 68
LLVM is smart enough to turn all of this into a bunch of labeled bytes representing our single-character enum variant names,
and a table that references them. As the compiler knows that all our variant names are just a single character,
it will hardcode the value 1
in there.
To be perfectly frank, the compiler could do even better in this case ;-) If all our names are just single characters, and we have dense discriminants, we wouldn’t need a table at all, we could just directly index into a flat string.
But okay, so far I am already impressed. Lets build on this example and make the discriminant names variable-length:
#[derive(Debug)]
pub enum Enum {
A,
BB,
CCC,
ABCD,
}
And the assembly output is now this:
<example::Enum as core::fmt::Debug>::fmt:
mov rax, rsi
movzx ecx, byte ptr [rdi]
lea rdx, [rip + .Lswitch.table.<example::Enum as core::fmt::Debug>::fmt]
mov rdx, qword ptr [rdx + 8*rcx]
lea rdi, [rip + .Lreltable.<example::Enum as core::fmt::Debug>::fmt]
movsxd rsi, dword ptr [rdi + 4*rcx]
add rsi, rdi
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.Lswitch.table.<example::Enum as core::fmt::Debug>::fmt:
.quad 1
.quad 2
.quad 3
.quad 4
[…]
.Lreltable.<example::Enum as core::fmt::Debug>::fmt:
.long .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_4-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_5-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_6-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
[…]
.L__unnamed_3:
.byte 65
.L__unnamed_4:
.zero 2,66
.L__unnamed_5:
.zero 3,67
.L__unnamed_6:
.ascii "ABCD"
As we can see, we now end up with two tables, one for the length of the names, and the second as before, with a pointer to the raw bytes. Also interesting that the compiler generates different assembler instructions depending on whether the letters are repeating, and how long the string is.
Our next experiment is making this a sparse enum by explicitly assigning discriminant values.
#[derive(Debug)]
pub enum Enum {
A = 0,
BB = 4,
CCC,
ABCD,
}
Again, a positive surprise from LLVM, as it will just insert duplicated entries into the tables, trading a bit of wasted space for efficient code:
.Lswitch.table.<example::Enum as core::fmt::Debug>::fmt:
.quad 1
.quad 1
.quad 1
.quad 1
.quad 2
.quad 3
.quad 4
.Lreltable.<example::Enum as core::fmt::Debug>::fmt:
.long .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_4-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_5-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
.long .L__unnamed_6-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
This optimization only triggers up to a certain threshold of course, and when inserting a gap of ~200 in between the discriminants, we end up with some vastly worse code being generated, with a chain of comparisons and conditional jumps:
<example::Enum as core::fmt::Debug>::fmt:
mov rax, rsi
movzx ecx, byte ptr [rdi]
cmp ecx, 200
jg .LBB1_5
test ecx, ecx
jne .LBB1_3
lea rsi, [rip + .L__unnamed_2]
mov edx, 1
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.LBB1_5:
cmp ecx, 201
jne .LBB1_6
lea rsi, [rip + .L__unnamed_3]
mov edx, 3
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.LBB1_3:
lea rsi, [rip + .L__unnamed_4]
mov edx, 2
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.LBB1_6:
lea rsi, [rip + .L__unnamed_5]
mov edx, 4
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
This is not only bad because it generates a ton of code that will lead to binary bloat as is evident from the example
I started out with. But this chain of comparisons also means that the Debug
impl effectively scales with the number
of variants in the enum, which is not particularly great.
# Perfect Hashing to the rescue?
Lets rewind a bit and take another look at how our enum looks like:
- It is data-less with an explicit
#[repr(u32)]
. - It has explicitly assigned discriminants which are sparse / non-contiguous.
- Plus: it has a finite / exhaustive number of discriminants.
- And we will only use valid discriminants in our
Debug
impl.
With these assumptions in place, we are searching for something that can map a finite
number of discriminants to a table of values.
The data-structure that can do this is a HashMap
of course.
And since all the discriminants and values are known at compile time,
we can use a perfect hash table (PHT) to encode everything statically at compile time.
I remember I read a very good article about how perfect hashing works in theory, but I can’t seem to find it right now, so I will have a go at explaining it.
Compared to a normal hash table, that always has some spare capacity, and needs to account for some hash collisions, the perfect hash table will always map valid hash keys to their values directly, using as little spare capacity if possible.
Such a hash function might work like this, simplified:
(input_value.wrapping_add(STARTING_OFFSET) ^ XOR_VALUE) % TABLE_SIZE
There are sophisticated algorithms that will find appropriate values for STARTING_OFFSET
and XOR_VALUE
while
minimizing TABLE_SIZE
.
But we just have 4
discriminants from our example above: [0, 200, 201, 202]
.
It should be possible to brute-force some values here that fit us just right.
I chose to allow for some slack space as the brute-forcing exact matches did not
yield any hits in a reasonable time, whereas when allowing some wasted space,
it gave me an answer immediately.
Here is the very naive implementation:
use rand::prelude::*;
fn validate_no_duplicate_idx(indices: &[u32]) -> bool {
let mut indices_hit: u32 = 0;
for idx in indices {
indices_hit |= 1 << idx;
}
indices_hit.count_ones() == indices.len() as u32
}
#[derive(Debug)]
struct PhfResult<const N: usize> {
start_value: u32,
xor_value: u32,
table_size: u32,
table_indices: [u32; N],
}
fn generate_perfect_hash_values<const N: usize>(
input_values: [u32; N],
slack_space: usize,
) -> PhfResult<N> {
let mut rng = rand::thread_rng();
loop {
let xor_value: u32 = rng.gen();
let start_value: u32 = rng.gen();
let table_size = (N + slack_space) as u32;
let table = input_values
.map(|input_value| (input_value.wrapping_add(start_value) ^ xor_value) % table_size);
if validate_no_duplicate_idx(&table) {
return PhfResult {
start_value,
xor_value,
table_size,
table_indices: table,
};
}
}
}
Running it on our example above gave me the following output:
PhfResult {
start_value: 1803446167,
xor_value: 597238773,
table_size: 5,
table: [
4,
3,
2,
1,
],
}
Using these values, I can then hand-craft a better Debug
impl, and validate that it works:
#[repr(u32)]
#[derive(Copy, Clone)]
pub enum Enum {
A = 0,
BB = 200,
CCC,
ABCD,
}
impl std::fmt::Debug for Enum {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
const TABLE: &[&str] = &["", "ABCD", "CCC", "BB", "A"];
let discriminant = *self as u32;
let index = (discriminant.wrapping_add(1803446167) ^ 597238773) % 5;
f.write_str(TABLE[index as usize])
}
}
#[test]
fn validate_debug_impl() {
assert_eq!(format!("{:?}", Enum::A), "A");
assert_eq!(format!("{:?}", Enum::BB), "BB");
assert_eq!(format!("{:?}", Enum::CCC), "CCC");
assert_eq!(format!("{:?}", Enum::ABCD), "ABCD");
}
But does it really compile to better code? Lets check again using the Compiler Explorer:
<example::Enum as core::fmt::Debug>::fmt:
mov rax, rsi
mov ecx, 1803446167
add ecx, dword ptr [rdi]
xor ecx, 597238773
imul rdx, rcx, 1717986919
shr rdx, 33
lea edx, [rdx + 4*rdx]
sub ecx, edx
shl rcx, 4
lea rdx, [rip + .L__unnamed_1]
mov rsi, qword ptr [rcx + rdx]
mov rdx, qword ptr [rcx + rdx + 8]
mov rdi, rax
jmp qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.L__unnamed_3:
.L__unnamed_4:
.ascii "ABCD"
.L__unnamed_5:
.zero 3,67
.L__unnamed_6:
.zero 2,66
.L__unnamed_7:
.byte 65
.L__unnamed_1:
.quad .L__unnamed_3
.zero 8
.quad .L__unnamed_4
.asciz "\004\000\000\000\000\000\000"
.quad .L__unnamed_5
.asciz "\003\000\000\000\000\000\000"
.quad .L__unnamed_6
.asciz "\002\000\000\000\000\000\000"
.quad .L__unnamed_7
.asciz "\001\000\000\000\000\000\000"
Indeed, we again end up with code that has no branches, and indexes right into a static table. Success!
# Making it scale
The above was just a toy example to make a point. But can we use the same principle to solve the original issue?
Obviously, I wouldn’t use my horribly bad code to brute-force these constants,
and I also wouldn’t hand-craft the Debug
impl either.
As in many other cases, there is a crate for that!
It is conveniently called phf
, and there is also a companion crate called phf_codegen
.
The combination of both crates indeed generates a HashMap
-like type, that allows arbitrary lookups.
In our case however, we know statically that we are only looking things up that
are guaranteed to be in the map. So I decided to rather build on phf_generator
and phf_shared
, which are the lower level building blocks.
Using phf_generator::generate_hash
returns a HashState
consisting of a random key
, a couple of
displacements
, and a map
which encodes the order in which we have to output our original values,
just like the table
in my example above.
We can then generate some code, and use phf_shared
to derive the index.
Here is the hand-crafted code feeding values generated by phf_generator
into phf_shared
:
impl std::fmt::Debug for Enum {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
const KEY: phf_shared::HashKey = 12913932095322966823;
const DISPLACEMENTS: &[(u32, u32)] = &[(1, 0)];
const TABLE: &[&str] = &["A", "BB", "ABCD", "CCC"];
let discriminant = *self as u32;
let hashes = phf_shared::hash(&discriminant, &KEY);
let index = phf_shared::get_index(&hashes, DISPLACEMENTS, TABLE.len());
f.write_str(TABLE[index as usize])
}
}
I will spare you the raw assembly code. The load from TABLE
is still the same as before,
otherwise we have a ton of inlined code related to using a proper hash function (Sip13/128)
which adds quite a lot of instructions.
Now the big question that remains is to implement all this for the minidump_common
use-case and see if it is actually better in terms of bloat and also compile times.
But that is an exercise for another day.
Update: Results are in, implementing this approach in minidump_common
for the Debug
and FromPrimitive
implementations yielded a ~100K win (about 1%) of the minidump-stackwalk
binary size. Not bad! Although I haven’t had a look at compile times at all.
Nonetheless, I’m not as convinced that a huge chunk of auto-generated code is the best solution to all this, and believe that this should be handled in the compiler itself. And as luck would have it, a similar issue was just raised a couple of days before I wrote this post. Though that issue describes a more general case where enums with a large number of variants exhibit poor codegen. The case I presented here is a bit more special, as it involves manually defined sparse discriminants with large gaps in their values.