Optimizing Rust Enum `Debug`-ing with Perfect Hashing

29 July 2023 — 8 min

This weekend is the start of my week of chill at home and do nothing vacation. Which, for a passionate software engineer, is the perfect time to do some open source work and dive into interesting topics outside of work.

The topic I will be looking at is optimizing code generation of Rust enums. This deep dive is motivated by a real world issue in the rust minidump(_common) crate.

The minidump_common crate defines two gigantic C-style enums that look a little bit like this:

#[repr(u32)]
#[derive(Copy, Clone, PartialEq, Eq, Debug, FromPrimitive)]
pub enum WinErrorWindows {
    ERROR_SUCCESS = 0,
    ERROR_INVALID_FUNCTION = 1,
    ERROR_FILE_NOT_FOUND = 2,
    ERROR_PATH_NOT_FOUND = 3,
    // ... about ~2_000 more variants
}

It is a data-less enum with an explicit u32 discriminant, the discriminant values are explicitly assigned, and the discriminant values are non-contiguous, meaning they have gaps in them.

The problem with this enum is that the Debug and FromPrimitive derives create a ton of bloat.

How much you ask? Well, according to cargo bloat, these two derives are among the top offenders compiling symbolicator, any symbolicator is a huge crate:

File  .text     Size                   Crate Name
 0.3%   0.5% 108.8KiB         minidump_common <minidump_common::errors::windows::WinErrorWindows as core::fmt::Debug>::fmt
 0.2%   0.3%  62.0KiB         minidump_common <minidump_common::errors::windows::WinErrorWindows as num_traits::cast::FromPrimitive>::from_u64

Another indicator of this bloat is cargo llvm-lines, which has the following to say about our enum:

  Lines                  Copies                Function name
  -----                  ------                -------------
  9852153                271193                (TOTAL)
    73972 (0.8%,  1.5%)    6184 (2.3%,  3.4%)  <&T as core::fmt::Debug>::fmt
     2832 (0.0%, 41.3%)       1 (0.0%, 25.6%)  <minidump_common::errors::windows::WinErrorWindows as core::fmt::Debug>::fmt
     2830 (0.0%, 41.4%)       1 (0.0%, 25.6%)  <minidump_common::errors::windows::WinErrorWindows as num_traits::cast::FromPrimitive>::from_u64

The cargo llvm-lines output does not flag it as such a huge offender, and at ~2000 lines, there is barely any improvements to be had, as that is fairly close to the number of variants this enum has.

# `Debug`-ing enum codegen

Lets break this whole problem down and start from the beginning with a dead simple enum.

#[derive(Debug)]
pub enum Enum {
    A,
    B,
    C,
    D,
}

pub fn fmt(e: Enum) -> String {
    format!("{e:?}")
}

Nothing out of the ordinary, and we can use this example to illustrate a couple of things.

First, lets see what kind of Rust code is being generated on behalf of that #[derive(Debug)]. We can do that directly on the Rust Playground. Just choose expand macros from the tools dropdown. This is what it looks like:

pub enum Enum { A, B, C, D, }
#[automatically_derived]
impl ::core::fmt::Debug for Enum {
    fn fmt(&self, f: &mut ::core::fmt::Formatter) -> ::core::fmt::Result {
        ::core::fmt::Formatter::write_str(f,
            match self {
                Enum::A => "A",
                Enum::B => "B",
                Enum::C => "C",
                Enum::D => "D",
            })
    }
}

Very straight forward, no surprises here.

When this is being compiled into native code by LLVM, and checking the output in the Compiler explorer, I got my first positive shock, as this is its output:

<example::Enum as core::fmt::Debug>::fmt:
        mov     rax, rsi
        movzx   ecx, byte ptr [rdi]
        lea     rdx, [rip + .Lreltable.<example::Enum as core::fmt::Debug>::fmt]
        movsxd  rsi, dword ptr [rdx + 4*rcx]
        add     rsi, rdx
        mov     edx, 1
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]

.Lreltable.<example::Enum as core::fmt::Debug>::fmt:
        .long   .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_4-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_5-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_6-.Lreltable.<example::Enum as core::fmt::Debug>::fmt

.L__unnamed_3:
        .byte   65

.L__unnamed_4:
        .byte   66

.L__unnamed_5:
        .byte   67

.L__unnamed_6:
        .byte   68

LLVM is smart enough to turn all of this into a bunch of labeled bytes representing our single-character enum variant names, and a table that references them. As the compiler knows that all our variant names are just a single character, it will hardcode the value 1 in there.

To be perfectly frank, the compiler could do even better in this case ;-) If all our names are just single characters, and we have dense discriminants, we wouldn’t need a table at all, we could just directly index into a flat string.

But okay, so far I am already impressed. Lets build on this example and make the discriminant names variable-length:

#[derive(Debug)]
pub enum Enum {
    A,
    BB,
    CCC,
    ABCD,
}

And the assembly output is now this:

<example::Enum as core::fmt::Debug>::fmt:
        mov     rax, rsi
        movzx   ecx, byte ptr [rdi]
        lea     rdx, [rip + .Lswitch.table.<example::Enum as core::fmt::Debug>::fmt]
        mov     rdx, qword ptr [rdx + 8*rcx]
        lea     rdi, [rip + .Lreltable.<example::Enum as core::fmt::Debug>::fmt]
        movsxd  rsi, dword ptr [rdi + 4*rcx]
        add     rsi, rdi
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]

.Lswitch.table.<example::Enum as core::fmt::Debug>::fmt:
        .quad   1
        .quad   2
        .quad   3
        .quad   4
        […]

.Lreltable.<example::Enum as core::fmt::Debug>::fmt:
        .long   .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_4-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_5-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_6-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        […]

.L__unnamed_3:
        .byte   65

.L__unnamed_4:
        .zero   2,66

.L__unnamed_5:
        .zero   3,67

.L__unnamed_6:
        .ascii  "ABCD"

As we can see, we now end up with two tables, one for the length of the names, and the second as before, with a pointer to the raw bytes. Also interesting that the compiler generates different assembler instructions depending on whether the letters are repeating, and how long the string is.

Our next experiment is making this a sparse enum by explicitly assigning discriminant values.

#[derive(Debug)]
pub enum Enum {
    A = 0,
    BB = 4,
    CCC,
    ABCD,
}

Again, a positive surprise from LLVM, as it will just insert duplicated entries into the tables, trading a bit of wasted space for efficient code:

.Lswitch.table.<example::Enum as core::fmt::Debug>::fmt:
        .quad   1
        .quad   1
        .quad   1
        .quad   1
        .quad   2
        .quad   3
        .quad   4

.Lreltable.<example::Enum as core::fmt::Debug>::fmt:
        .long   .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_3-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_4-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_5-.Lreltable.<example::Enum as core::fmt::Debug>::fmt
        .long   .L__unnamed_6-.Lreltable.<example::Enum as core::fmt::Debug>::fmt

This optimization only triggers up to a certain threshold of course, and when inserting a gap of ~200 in between the discriminants, we end up with some vastly worse code being generated, with a chain of comparisons and conditional jumps:

<example::Enum as core::fmt::Debug>::fmt:
        mov     rax, rsi
        movzx   ecx, byte ptr [rdi]
        cmp     ecx, 200
        jg      .LBB1_5
        test    ecx, ecx
        jne     .LBB1_3
        lea     rsi, [rip + .L__unnamed_2]
        mov     edx, 1
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.LBB1_5:
        cmp     ecx, 201
        jne     .LBB1_6
        lea     rsi, [rip + .L__unnamed_3]
        mov     edx, 3
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.LBB1_3:
        lea     rsi, [rip + .L__unnamed_4]
        mov     edx, 2
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]
.LBB1_6:
        lea     rsi, [rip + .L__unnamed_5]
        mov     edx, 4
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]

This is not only bad because it generates a ton of code that will lead to binary bloat as is evident from the example I started out with. But this chain of comparisons also means that the Debug impl effectively scales with the number of variants in the enum, which is not particularly great.

# Perfect Hashing to the rescue?

Lets rewind a bit and take another look at how our enum looks like:

It is data-less with an explicit #[repr(u32)].
It has explicitly assigned discriminants which are sparse / non-contiguous.
Plus: it has a finite / exhaustive number of discriminants.
And we will only use valid discriminants in our Debug impl.

With these assumptions in place, we are searching for something that can map a finite number of discriminants to a table of values. The data-structure that can do this is a HashMap of course. And since all the discriminants and values are known at compile time, we can use a perfect hash table (PHT) to encode everything statically at compile time.

I remember I read a very good article about how perfect hashing works in theory, but I can’t seem to find it right now, so I will have a go at explaining it.

Compared to a normal hash table, that always has some spare capacity, and needs to account for some hash collisions, the perfect hash table will always map valid hash keys to their values directly, using as little spare capacity if possible.

Such a hash function might work like this, simplified:

(input_value.wrapping_add(STARTING_OFFSET) ^ XOR_VALUE) % TABLE_SIZE

There are sophisticated algorithms that will find appropriate values for STARTING_OFFSET and XOR_VALUE while minimizing TABLE_SIZE.

But we just have 4 discriminants from our example above: [0, 200, 201, 202]. It should be possible to brute-force some values here that fit us just right. I chose to allow for some slack space as the brute-forcing exact matches did not yield any hits in a reasonable time, whereas when allowing some wasted space, it gave me an answer immediately.

Here is the very naive implementation:

use rand::prelude::*;

fn validate_no_duplicate_idx(indices: &[u32]) -> bool {
    let mut indices_hit: u32 = 0;
    for idx in indices {
        indices_hit |= 1 << idx;
    }
    indices_hit.count_ones() == indices.len() as u32
}

#[derive(Debug)]
struct PhfResult<const N: usize> {
    start_value: u32,
    xor_value: u32,
    table_size: u32,
    table_indices: [u32; N],
}

fn generate_perfect_hash_values<const N: usize>(
    input_values: [u32; N],
    slack_space: usize,
) -> PhfResult<N> {
    let mut rng = rand::thread_rng();
    loop {
        let xor_value: u32 = rng.gen();
        let start_value: u32 = rng.gen();
        let table_size = (N + slack_space) as u32;

        let table = input_values
            .map(|input_value| (input_value.wrapping_add(start_value) ^ xor_value) % table_size);

        if validate_no_duplicate_idx(&table) {
            return PhfResult {
                start_value,
                xor_value,
                table_size,
                table_indices: table,
            };
        }
    }
}

Running it on our example above gave me the following output:

PhfResult {
    start_value: 1803446167,
    xor_value: 597238773,
    table_size: 5,
    table: [
        4,
        3,
        2,
        1,
    ],
}

Using these values, I can then hand-craft a better Debug impl, and validate that it works:


#[repr(u32)]
#[derive(Copy, Clone)]
pub enum Enum {
    A = 0,
    BB = 200,
    CCC,
    ABCD,
}

impl std::fmt::Debug for Enum {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        const TABLE: &[&str] = &["", "ABCD", "CCC", "BB", "A"];
        let discriminant = *self as u32;
        let index = (discriminant.wrapping_add(1803446167) ^ 597238773) % 5;
        f.write_str(TABLE[index as usize])
    }
}

#[test]
fn validate_debug_impl() {
    assert_eq!(format!("{:?}", Enum::A), "A");
    assert_eq!(format!("{:?}", Enum::BB), "BB");
    assert_eq!(format!("{:?}", Enum::CCC), "CCC");
    assert_eq!(format!("{:?}", Enum::ABCD), "ABCD");
}

But does it really compile to better code? Lets check again using the Compiler Explorer:

<example::Enum as core::fmt::Debug>::fmt:
        mov     rax, rsi
        mov     ecx, 1803446167
        add     ecx, dword ptr [rdi]
        xor     ecx, 597238773
        imul    rdx, rcx, 1717986919
        shr     rdx, 33
        lea     edx, [rdx + 4*rdx]
        sub     ecx, edx
        shl     rcx, 4
        lea     rdx, [rip + .L__unnamed_1]
        mov     rsi, qword ptr [rcx + rdx]
        mov     rdx, qword ptr [rcx + rdx + 8]
        mov     rdi, rax
        jmp     qword ptr [rip + _ZN4core3fmt9Formatter9write_str17hdb374abbd294d87eE@GOTPCREL]

.L__unnamed_3:

.L__unnamed_4:
        .ascii  "ABCD"

.L__unnamed_5:
        .zero   3,67

.L__unnamed_6:
        .zero   2,66

.L__unnamed_7:
        .byte   65

.L__unnamed_1:
        .quad   .L__unnamed_3
        .zero   8
        .quad   .L__unnamed_4
        .asciz  "\004\000\000\000\000\000\000"
        .quad   .L__unnamed_5
        .asciz  "\003\000\000\000\000\000\000"
        .quad   .L__unnamed_6
        .asciz  "\002\000\000\000\000\000\000"
        .quad   .L__unnamed_7
        .asciz  "\001\000\000\000\000\000\000"

Indeed, we again end up with code that has no branches, and indexes right into a static table. Success!

# Making it scale

The above was just a toy example to make a point. But can we use the same principle to solve the original issue?

Obviously, I wouldn’t use my horribly bad code to brute-force these constants, and I also wouldn’t hand-craft the Debug impl either. As in many other cases, there is a crate for that!

It is conveniently called phf, and there is also a companion crate called phf_codegen. The combination of both crates indeed generates a HashMap-like type, that allows arbitrary lookups.

In our case however, we know statically that we are only looking things up that are guaranteed to be in the map. So I decided to rather build on phf_generator and phf_shared, which are the lower level building blocks.

Using phf_generator::generate_hash returns a HashState consisting of a random key, a couple of displacements, and a map which encodes the order in which we have to output our original values, just like the table in my example above.

We can then generate some code, and use phf_shared to derive the index. Here is the hand-crafted code feeding values generated by phf_generator into phf_shared:

impl std::fmt::Debug for Enum {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        const KEY: phf_shared::HashKey = 12913932095322966823;
        const DISPLACEMENTS: &[(u32, u32)] = &[(1, 0)];
        const TABLE: &[&str] = &["A", "BB", "ABCD", "CCC"];

        let discriminant = *self as u32;
        let hashes = phf_shared::hash(&discriminant, &KEY);
        let index = phf_shared::get_index(&hashes, DISPLACEMENTS, TABLE.len());

        f.write_str(TABLE[index as usize])
    }
}

I will spare you the raw assembly code. The load from TABLE is still the same as before, otherwise we have a ton of inlined code related to using a proper hash function (Sip13/128) which adds quite a lot of instructions.

Now the big question that remains is to implement all this for the minidump_common use-case and see if it is actually better in terms of bloat and also compile times. But that is an exercise for another day.

Update: Results are in, implementing this approach in minidump_common for the Debug and FromPrimitive implementations yielded a ~100K win (about 1%) of the minidump-stackwalk binary size. Not bad! Although I haven’t had a look at compile times at all.

Nonetheless, I’m not as convinced that a huge chunk of auto-generated code is the best solution to all this, and believe that this should be handled in the compiler itself. And as luck would have it, a similar issue was just raised a couple of days before I wrote this post. Though that issue describes a more general case where enums with a large number of variants exhibit poor codegen. The case I presented here is a bit more special, as it involves manually defined sparse discriminants with large gaps in their values.

Optimizing Rust Enum `Debug`-ing with Perfect Hashing

# Debug-ing enum codegen

# Perfect Hashing to the rescue?

# Making it scale

# `Debug`-ing enum codegen