Swatinem Blog Resume

The magic of scope guards

— 7 min

Scope guards in Rust are awesome!

Apart from explaining why, I also want to explore one specific side of them that I have never read about directly: their effect on compile times.

# A small RA(II)nt

Let me start todays exploration with a bit of a rant. What I am talking about today, and what I refer to as “scope guards” is often called the RAII pattern. That stands for “resource acquisition is initialization”, and I believe its a horrible acronym. What are we initializing? Are we even acquiring anything? Well maybe when we are talking about locks yes, but otherwise?

Apart from that, I am also very much against computer-science-speak. More specifically, hiding otherwise easy to understand concepts behind complicated-sounding nomenclature. In computer-science-speak, this concept is called “affine types”. What the hell does “affine” even mean? I am not a native speaker, but according to a dictionary, it is also translated as “affin” in my native german. Well thanks for nothing. Another translation is “verwandt”, which means “related” in english. Okay, that does not help either.

These “affine types” are also related (see what I did there?) to “linear types” that are being discussed in the Rust community right now. Another word that on its own does not convey any meaning.


Putting these concepts into words that anyone should be able to understand: A scope guards, RAII or affine type is a type that has a destructor (a piece of code) which is called automatically at the end of its scope. Hence I call them scope guards, as I believe that describes their use-case the best. If I understand the whole linear type debate, the problem is that the scope can in theory be extended to infinity by leaking the type, which is especially bad for types that require their destructor to run for soundness.

# Scopes

This brings us to the classic example that exhibited one of these soundness problems: Scoped Threads. I do not want to go into the details here, as I bet I would get half of that wrong.

What I do want to highlight is the usage of closures, specifically a FnOnce that has harder guarantees of enforcing destructors to run than scope guards. The function that executes the closure will only return to its caller when all the necessary cleanup is done. But being a (generic) function comes with two major downsides.

One is the function-coloring problem that makes it not play well with async code. And the other one is that it is generic and will thus be monomorphized by the compiler. The compiler will compile the outer function multiple times, in the worst case for every time it is called.

An interesting observation on the side is that you can always trivially move from a scope-guard version of code to a closure version:

fn takes_closure<O, F: FnOnce() -> O>(f: F) {
    let _guard = create_guard();
    f()
}

Sometimes you want the compiler to duplicate and inline all the code. Sometimes inlining it will give better runtime performance. But depending on how large the code is, outlining might be the better idea. While outlined code might introduce more jumps and another stack frame, but depending on how hot the actual code is, it might be better for the instruction cache.

# Compile times

But today I want to specifically focus on compile times.

For this I first created a chunk of large and slow to compile code:

macro_rules! blow_up {
    ($a:ident) => {
        println!("hello {}!", stringify!($a));
    };

    ($a:ident $($rest:tt)+) => {
        blow_up!($a);
        blow_up!($($rest)+);
        blow_up!($($rest)+);
    }
}

macro_rules! make_slow {
    () => {
        blow_up!(
            a0 b0 c0 d0 e0 f0 g0 h0 i0 j0
        );
    }
}

This code intentionally generates an exponential number of println! statements to make sure it is slow to compile, and compiles to a ton of code, so we have something to measure.

Going with the closure-based code first, we want to put this code both before and after our actual closure call, like this:

fn takes_closure<O, F: FnOnce() -> O>(f: F) -> O {
    make_slow!();
    let o = f();
    make_slow!();
    o
}

And in the end we will invoke the closure with a couple of times with different times to be extra sure the compiler will compile it multiple times:

fn main() {
    let a = takes_closure(|| 1u8);
    print!("{a}");
    let a = takes_closure(|| 1u16);
    print!("{a}");
    let a = takes_closure(|| 1u32);
    print!("{a}");
    let a = takes_closure(|| 1u64);
    print!("{a}");

    println!();
}

On my system, compiling this code in debug mode with -Z time-passes takes a bit over 2 seconds, and highlights a couple of slow parts of compilation:

time:   0.368; rss:   55MB ->   91MB (  +36MB)  MIR_borrow_checking
time:   1.108; rss:  122MB ->   46MB (  -77MB)  LLVM_passes
time:   1.276; rss:   65MB ->   46MB (  -19MB)  link
time:   2.383; rss:   10MB ->   39MB (  +29MB)  total

Doing a --release build increases the timing a little, obviously:

time:   3.427; rss:  118MB ->   46MB (  -72MB)  LLVM_passes
time:   3.631; rss:   54MB ->   46MB (   -8MB)  link
time:   4.628; rss:   10MB ->   41MB (  +31MB)  total

Looking at the cargo llvm-lines output reveals that we have 4 copies of the same function, as expected:

  Lines                Copies            Function name
  -----                ------            -------------
  98640                25                (TOTAL)
  98292 (99.6%, 99.6%)  4 (16.0%, 16.0%) guards_closures::takes_closure

LLVM is rightly slow, as it has a ton to compile. Can we do better on that front, by moving all that code to the scope guard pattern, at the same time making it compatible with async code?

Lets see. First up, we need our guard type:

struct Guard;

impl Guard {
    pub fn new() -> Self {
        make_slow!();
        Self
    }
}

impl Drop for Guard {
    fn drop(&mut self) {
        make_slow!();
    }
}

There is no generic code here anymore, which is exactly what we wanted to achieve. We can then manually create some scopes, create the guard type and have its destructor automatically called at the end:


fn main() {
    {
        let _guard = Guard::new();
        print!("{}", 1u8);
    }
    {
        let _guard = Guard::new();
        print!("{}", 1u16);
    }
    {
        let _guard = Guard::new();
        print!("{}", 1u32);
    }
    {
        let _guard = Guard::new();
        print!("{}", 1u64);
    }

    println!();
}

How does it do in terms of compile times and cargo llvm-lines now?

time:   0.288; rss:   50MB ->   82MB (  +32MB)  MIR_borrow_checking
time:   0.072; rss:   95MB ->   43MB (  -52MB)  LLVM_passes
time:   0.246; rss:   62MB ->   44MB (  -18MB)  link
time:   1.069; rss:   10MB ->   37MB (  +28MB)  total

The MIR borrow checking time might as well just be some noise, but the LLVM time is a lot faster.

Here is the same for a --release build:

time:   1.066; rss:   93MB ->   45MB (  -48MB)  LLVM_passes
time:   1.272; rss:   44MB ->   45MB (   +1MB)  link
time:   2.028; rss:   10MB ->   40MB (  +30MB)  total

That is a bit more than twice as fast to compile than the closure-based version. Lets check the llvm-lines output:

Lines                Copies            Function name
  -----                ------            -------------
  24917                20                (TOTAL)
  12280 (49.3%, 49.3%)  1 (5.0%,  5.0%)  <guards_guards::Guard as core::ops::drop::Drop>::drop
  12277 (49.3%, 98.6%)  1 (5.0%, 10.0%)  guards_guards::Guard::new

As expected, we only have a single copy of the expensive constructor and destructor, and as expected about 4 times less code to compile than before.

How does this affect the final binary size?

Looking at the Windows .exe (without the .pdb) I end up with the following matrix:

PatternDebugRelease
closures630K247K
guards274K200K

That is a big difference indeed, both in compile times, and in the size of the compiled executable.


I will end todays exploration on that note. In the real world, I have an example of a way-too-generic crate that I suspect to massively slow down compile times, and I would like to explore moving it to a scope-guard code style.

This blog post explores that idea in a “lab setting”. I do not yet know if the same improvements could be had with some real world code as well. Not to mention that actually migrating the codebase might be a huge effort on its own.