Towards fast `thread_local!` context

17 May 2024 — 7 min

I have already blogged about the surprisingly slow Rust thread_local!s, and today I ran into another related problem. And as I often do, I went right down a rabbit hole digging into it.

To give a bit of context, I am currently working on a metrics crate. One of my main goals is to make it as high performance and as low overhead as possible.

Another goal is to make it simple to use. That means having a global fire-and-forget API. Just slap a counter!("counter_name": 1) anywhere and it should just work. The emitted metrics should make their way to a globally defined collector / sink.

This is very similar to how the tracing ecosystem works, and I took a lot of inspiration from it. Inspiration both in terms of API, but also implementation.

If you take a look at the tracing::dispatcher docs, you will find a couple of global functions:

set_global_default to set a single global dispatcher.
set_default to set a thread-local dispatcher and return a scope guard which unsets it.
get_default to access the current dispatcher, preferring the thread-local one, and falling back to the global one.

This is quite a common pattern and is not limited to metrics or tracing at all. It was widely discussed in the Rust community over two years ago as Contexts and capabilities. I also chimed in with my own blog post back then. All the hype however was short lived, as I haven’t heard anything about it ever since. But as we will see, this whole problem is still alive and well, as just use thread_local! does not really cut it.

Lets get back to code, and lets look at the implementation of get_default in tracing. We can immediately see something surprising:

    if SCOPED_COUNT.load(Ordering::Acquire) == 0 {
        // fast path if no scoped dispatcher has been set; just use the global
        // default.
        return f(get_global());
    }

As the comment suggests, accessing thread locals is so slow that its beneficial to avoid them at all cost, for example by using a global AtomicUsize to keep track of the number of thread locals.

The assumption here is that in most cases, you will only have a single global context, or maybe even none at all. Using thread-local overrides is rather rare, mostly used within tests where you want to have isolation.

This is an interesting hypothesis, so lets try to prove it with some benchmarking.

So, instead of a local fallback version that does:

check thread_local if its usable,
fall back to global otherwise

It might be faster to use a global override version which does this instead:

if any thread_local is defined:
- check own thread_local if its usable,
fall back to global otherwise

My intuition here is that the fast path might surely prove to be warranted, but might as well turn out to be worse in case we actually have a thread local override.

So I expect things to be ranked performance-wise like this:

using global with fast path (no thread local set)
using a thread local value (thread local is set)
using global without fast path (no thread local set)
using a thread local despite the fast path (thread local is set)

# Benchmarks

Benchmarking this however turns out to be a bit difficult, in particular separating the setup/teardown code from the actual benchmark. In this specific case, setting / unsetting the thread local value, vs actually using it.

I use divan for benchmarks primarily because it has a convenient API, and because it allows to trivially run the benchmark within multiple threads just by using the DIVAN_THREADS env variable. I actually a fork which avoids spawning thousands of threads, but reuses just as many threads as were requested.

While in theory, divan allows separating the setup and teardown code by using with_inputs for setup, and returning a type to be Drop-ed later, the way this works is fundamentally incompatible with scope guards, as it just initializes all the inputs first, runs the benchmark, and then Drops all the outputs afterwards.

So all of the benchmark results for the benchmarks with the thread local set are also running the code for setting and unsetting the thread_local. I tried to account for that fact by just running multiple reads per benchmark iteration, so there is a higher read to write ratio.

I won’t be copy-pasting the code in this blog post, but curious folks can find all the code here

Here are the first results:

locals                          fastest       │ slowest       │ median        │ mean
├─ _1_stable_copy                             │               │               │
│  ├─ _1_global_override_unset  984.2 ns      │ 86.33 µs      │ 1.014 µs      │ 1.034 µs
│  ├─ _2_local_fallback_set     1.074 µs      │ 19.36 µs      │ 1.095 µs      │ 1.124 µs
│  ├─ _3_local_fallback_unset   1.155 µs      │ 20.12 µs      │ 1.176 µs      │ 1.209 µs
│  ╰─ _4_global_override_set    1.196 µs      │ 20.78 µs      │ 1.236 µs      │ 1.263 µs

The benchmark results do prove my initial intuition, though only by a small-ish margin. I am still far from being able to achieve stable benchmark timings, and I see quite some variance between runs.

Here, the global override solution that checks a global flag if any thread_local override exists is the fastest, and checking the thread_local first (local fallback) comes in second.

However, if one actually intends to use the thread_local overrides, using only a thread_local! for that is a bit faster than additionally maintaining a global atomic counter, no surprise there.

Now, as I mentioned, one of the selling points of divan is to trivially run the benchmarks in multiple threads, so lets do just that via DIVAN_THREADS=0,1:

locals                          fastest       │ slowest       │ median        │ mean
├─ _1_stable_copy                             │               │               │
│  ├─ _1_global_override_unset                │               │               │
│  │  ├─ t=1                    978.7 ns      │ 34.87 µs      │ 994.2 ns      │ 1.016 µs
│  │  ╰─ t=16                   994.2 ns      │ 7 µs          │ 1.054 µs      │ 1.081 µs
│  ├─ _2_local_fallback_set                   │               │               │
│  │  ├─ t=1                    1.06 µs       │ 25.58 µs      │ 1.11 µs       │ 1.152 µs
│  │  ╰─ t=16                   1.145 µs      │ 11.78 µs      │ 1.245 µs      │ 1.243 µs
│  ├─ _3_local_fallback_unset                 │               │               │
│  │  ├─ t=1                    1.18 µs       │ 25.1 µs       │ 1.2 µs        │ 1.237 µs
│  │  ╰─ t=16                   1.206 µs      │ 6.94 µs       │ 1.276 µs      │ 1.292 µs
│  ╰─ _4_global_override_set                  │               │               │
│     ├─ t=1                    1.201 µs      │ 24.78 µs      │ 1.256 µs      │ 1.293 µs
│     ╰─ t=16                   1.316 µs      │ 12.48 µs      │ 1.417 µs      │ 1.447 µs

Running things in multiple threads does not really make such a big difference, surprisingly. Even running 16 threads, whereas my CPU only has 8 physical cores. I would have expected things to slow down quite significantly, especially in the worst case scenario when multiple threads are contending on the single global “does any thread_local exist” counter.

So things aren’t even as bad as I made them out to be? So where is the problem?

Well, anyone who might have clicked the link to the code above might have seen that the first level here was using a trivial Cell<usize>. Things get quite a bit more complicated, both in terms of code to write, as well as in performance characteristics when dealing with more complex types which are not Copy, and have a custom Drop implementation.

So lets do the same thing, but this time with a RefCell<Option<String>>. We cannot just get() the value anymore, we have to use with_borrow and nest all the code in a closure because of lifetimes.

How are things looking with a RefCell<Option<String>>?

locals                          fastest       │ slowest       │ median        │ mean
├─ _2_stable_drop                             │               │               │
│  ├─ _1_global_override_unset                │               │               │
│  │  ├─ t=1                    1.452 µs      │ 18.88 µs      │ 1.493 µs      │ 1.522 µs
│  │  ╰─ t=16                   1.568 µs      │ 13.02 µs      │ 1.688 µs      │ 1.708 µs
│  ├─ _2_local_fallback_set                   │               │               │
│  │  ├─ t=1                    3.893 µs      │ 50.76 µs      │ 3.994 µs      │ 4.087 µs
│  │  ╰─ t=16                   4.342 µs      │ 17.37 µs      │ 4.635 µs      │ 4.724 µs
│  ├─ _3_local_fallback_unset                 │               │               │
│  │  ├─ t=1                    3.808 µs      │ 49.25 µs      │ 3.929 µs      │ 4.014 µs
│  │  ╰─ t=16                   3.99 µs       │ 19.42 µs      │ 4.231 µs      │ 4.331 µs
│  ╰─ _4_global_override_set                  │               │               │
│     ├─ t=1                    4.02 µs       │ 51.16 µs      │ 4.102 µs      │ 4.238 µs
│     ╰─ t=16                   4.425 µs      │ 32.14 µs      │ 4.677 µs      │ 4.907 µs

Now there is a much bigger win from our optimization, more than 2x actually.

# Nightly

So far, all the benchmarks were using the thread_local! macro which is available on stable Rust. There is also the nightly-only #[thread_local] attribute that can be used with a static mut, which also means that one has to use unsafe for any access to the thread-local. As a side note, the tracking issue for the #[thread_local] attribute was opened in 2015, so I’m doubtful this will make progress, and we might not even need it after all.

I tried that as well, using a straight-up Option<String> to avoid any RefCell overhead. In this case we are using shared references, so its not quite as simple to cause major breakage as it would be with a mutable reference. As we have to use unsafe, all other restrictions still apply.

Using the nightly #[thread_local] attribute gives these results for trivial, and complex types:

locals                          fastest       │ slowest       │ median        │ mean
├─ _3_nightly_copy                            │               │               │
│  ├─ _1_global_override_unset                │               │               │
│  │  ├─ t=1                    1.133 µs      │ 33.37 µs      │ 1.164 µs      │ 1.185 µs
│  │  ╰─ t=16                   1.179 µs      │ 9.845 µs      │ 1.219 µs      │ 1.25 µs
│  ├─ _2_local_fallback_set                   │               │               │
│  │  ├─ t=1                    1.29 µs       │ 22 µs         │ 1.316 µs      │ 1.351 µs
│  │  ╰─ t=16                   1.331 µs      │ 7.751 µs      │ 1.386 µs      │ 1.425 µs
│  ├─ _3_local_fallback_unset                 │               │               │
│  │  ├─ t=1                    1.391 µs      │ 19.19 µs      │ 1.441 µs      │ 1.463 µs
│  │  ╰─ t=16                   1.481 µs      │ 13.83 µs      │ 1.552 µs      │ 1.605 µs
│  ╰─ _4_global_override_set                  │               │               │
│     ├─ t=1                    1.441 µs      │ 23.64 µs      │ 1.467 µs      │ 1.498 µs
│     ╰─ t=16                   1.523 µs      │ 42.82 µs      │ 1.613 µs      │ 1.675 µs
╰─ _4_nightly_drop                            │               │               │
   ├─ _1_global_override_unset                │               │               │
   │  ├─ t=1                    1.467 µs      │ 74.43 µs      │ 1.498 µs      │ 1.538 µs
   │  ╰─ t=16                   1.573 µs      │ 27.9 µs       │ 1.654 µs      │ 1.68 µs
   ├─ _2_local_fallback_set                   │               │               │
   │  ├─ t=1                    1.886 µs      │ 56.67 µs      │ 1.957 µs      │ 1.991 µs
   │  ╰─ t=16                   2.25 µs       │ 58.89 µs      │ 2.542 µs      │ 2.598 µs
   ├─ _3_local_fallback_unset                 │               │               │
   │  ├─ t=1                    1.806 µs      │ 27.15 µs      │ 1.871 µs      │ 1.901 µs
   │  ╰─ t=16                   1.967 µs      │ 40.45 µs      │ 2.047 µs      │ 2.129 µs
   ╰─ _4_global_override_set                  │               │               │
      ├─ t=1                    2.047 µs      │ 53.8 µs       │ 2.069 µs      │ 2.117 µs
      ╰─ t=16                   2.321 µs      │ 36.43 µs      │ 2.594 µs      │ 2.706 µs

Interestingly, this turns out being worse for trivial types, but ends up being a win for more complex types. Not sure if the win comes from avoiding the RefCell though. That remains an exercise for the reader.