Towards fast `thread_local!` context
— 7 minI have already blogged about the surprisingly slow Rust thread_local!
s,
and today I ran into another related problem. And as I often do, I went right down a rabbit hole digging into it.
To give a bit of context, I am currently working on a metrics crate. One of my main goals is to make it as high performance and as low overhead as possible.
Another goal is to make it simple to use. That means having a global fire-and-forget API.
Just slap a counter!("counter_name": 1)
anywhere and it should just work. The emitted metrics should make their way
to a globally defined collector / sink.
This is very similar to how the tracing
ecosystem works, and I took a lot of inspiration
from it. Inspiration both in terms of API, but also implementation.
If you take a look at the tracing::dispatcher
docs,
you will find a couple of global functions:
set_global_default
to set a single global dispatcher.set_default
to set a thread-local dispatcher and return a scope guard which unsets it.get_default
to access the current dispatcher, preferring the thread-local one, and falling back to the global one.
This is quite a common pattern and is not limited to metrics or tracing at all.
It was widely discussed in the Rust community over two years ago as Contexts and capabilities.
I also chimed in with my own blog post back then.
All the hype however was short lived, as I haven’t heard anything about it ever since.
But as we will see, this whole problem is still alive and well, as just use thread_local!
does not really cut it.
Lets get back to code, and lets look at the
implementation of get_default
in tracing
.
We can immediately see something surprising:
if SCOPED_COUNT.load(Ordering::Acquire) == 0 {
// fast path if no scoped dispatcher has been set; just use the global
// default.
return f(get_global());
}
As the comment suggests, accessing thread locals is so slow that its beneficial to avoid them at all cost,
for example by using a global AtomicUsize
to keep track of the number of thread locals.
The assumption here is that in most cases, you will only have a single global context, or maybe even none at all. Using thread-local overrides is rather rare, mostly used within tests where you want to have isolation.
This is an interesting hypothesis, so lets try to prove it with some benchmarking.
So, instead of a local fallback version that does:
- check
thread_local
if its usable, - fall back to
global
otherwise
It might be faster to use a global override version which does this instead:
- if any
thread_local
is defined:- check own
thread_local
if its usable,
- check own
- fall back to
global
otherwise
My intuition here is that the fast path might surely prove to be warranted, but might as well turn out to be worse in case we actually have a thread local override.
So I expect things to be ranked performance-wise like this:
- using global with fast path (no thread local set)
- using a thread local value (thread local is set)
- using global without fast path (no thread local set)
- using a thread local despite the fast path (thread local is set)
# Benchmarks
Benchmarking this however turns out to be a bit difficult, in particular separating the setup/teardown code from the actual benchmark. In this specific case, setting / unsetting the thread local value, vs actually using it.
I use divan
for benchmarks primarily because it has a convenient API, and because it allows
to trivially run the benchmark within multiple threads just by using the DIVAN_THREADS
env variable.
I actually a fork which avoids spawning thousands of threads,
but reuses just as many threads as were requested.
While in theory, divan
allows separating the setup and teardown code by using with_inputs
for setup,
and returning a type to be Drop
-ed later, the way this works is fundamentally incompatible with scope guards,
as it just initializes all the inputs first, runs the benchmark, and then Drop
s all the outputs afterwards.
So all of the benchmark results for the benchmarks with the thread local set are also running the code for setting and unsetting
the thread_local
.
I tried to account for that fact by just running multiple reads per benchmark iteration, so there is a higher read to write ratio.
I won’t be copy-pasting the code in this blog post, but curious folks can find all the code here
Here are the first results:
locals fastest │ slowest │ median │ mean
├─ _1_stable_copy │ │ │
│ ├─ _1_global_override_unset 984.2 ns │ 86.33 µs │ 1.014 µs │ 1.034 µs
│ ├─ _2_local_fallback_set 1.074 µs │ 19.36 µs │ 1.095 µs │ 1.124 µs
│ ├─ _3_local_fallback_unset 1.155 µs │ 20.12 µs │ 1.176 µs │ 1.209 µs
│ ╰─ _4_global_override_set 1.196 µs │ 20.78 µs │ 1.236 µs │ 1.263 µs
The benchmark results do prove my initial intuition, though only by a small-ish margin. I am still far from being able to achieve stable benchmark timings, and I see quite some variance between runs.
Here, the global override solution that checks a global flag if any thread_local
override exists is the fastest,
and checking the thread_local
first (local fallback) comes in second.
However, if one actually intends to use the thread_local
overrides, using only a thread_local!
for that is a bit faster
than additionally maintaining a global atomic counter, no surprise there.
Now, as I mentioned, one of the selling points of divan
is to trivially run the benchmarks in multiple threads,
so lets do just that via DIVAN_THREADS=0,1
:
locals fastest │ slowest │ median │ mean
├─ _1_stable_copy │ │ │
│ ├─ _1_global_override_unset │ │ │
│ │ ├─ t=1 978.7 ns │ 34.87 µs │ 994.2 ns │ 1.016 µs
│ │ ╰─ t=16 994.2 ns │ 7 µs │ 1.054 µs │ 1.081 µs
│ ├─ _2_local_fallback_set │ │ │
│ │ ├─ t=1 1.06 µs │ 25.58 µs │ 1.11 µs │ 1.152 µs
│ │ ╰─ t=16 1.145 µs │ 11.78 µs │ 1.245 µs │ 1.243 µs
│ ├─ _3_local_fallback_unset │ │ │
│ │ ├─ t=1 1.18 µs │ 25.1 µs │ 1.2 µs │ 1.237 µs
│ │ ╰─ t=16 1.206 µs │ 6.94 µs │ 1.276 µs │ 1.292 µs
│ ╰─ _4_global_override_set │ │ │
│ ├─ t=1 1.201 µs │ 24.78 µs │ 1.256 µs │ 1.293 µs
│ ╰─ t=16 1.316 µs │ 12.48 µs │ 1.417 µs │ 1.447 µs
Running things in multiple threads does not really make such a big difference, surprisingly. Even running 16 threads,
whereas my CPU only has 8 physical cores. I would have expected things to slow down quite significantly, especially
in the worst case scenario when multiple threads are contending on the single global “does any thread_local
exist” counter.
So things aren’t even as bad as I made them out to be? So where is the problem?
Well, anyone who might have clicked the link to the code above might have seen that the first level here was using
a trivial Cell<usize>
. Things get quite a bit more complicated, both in terms of code to write, as well as in
performance characteristics when dealing with more complex types which are not Copy
, and have a custom Drop
implementation.
So lets do the same thing, but this time with a RefCell<Option<String>>
. We cannot just get()
the value anymore,
we have to use with_borrow
and nest all the code in a closure because of lifetimes.
How are things looking with a RefCell<Option<String>>
?
locals fastest │ slowest │ median │ mean
├─ _2_stable_drop │ │ │
│ ├─ _1_global_override_unset │ │ │
│ │ ├─ t=1 1.452 µs │ 18.88 µs │ 1.493 µs │ 1.522 µs
│ │ ╰─ t=16 1.568 µs │ 13.02 µs │ 1.688 µs │ 1.708 µs
│ ├─ _2_local_fallback_set │ │ │
│ │ ├─ t=1 3.893 µs │ 50.76 µs │ 3.994 µs │ 4.087 µs
│ │ ╰─ t=16 4.342 µs │ 17.37 µs │ 4.635 µs │ 4.724 µs
│ ├─ _3_local_fallback_unset │ │ │
│ │ ├─ t=1 3.808 µs │ 49.25 µs │ 3.929 µs │ 4.014 µs
│ │ ╰─ t=16 3.99 µs │ 19.42 µs │ 4.231 µs │ 4.331 µs
│ ╰─ _4_global_override_set │ │ │
│ ├─ t=1 4.02 µs │ 51.16 µs │ 4.102 µs │ 4.238 µs
│ ╰─ t=16 4.425 µs │ 32.14 µs │ 4.677 µs │ 4.907 µs
Now there is a much bigger win from our optimization, more than 2x actually.
# Nightly
So far, all the benchmarks were using the thread_local!
macro which is available on stable Rust.
There is also the nightly-only #[thread_local]
attribute that can be used with a static mut
,
which also means that one has to use unsafe
for any access to the thread-local.
As a side note, the tracking issue for the #[thread_local]
attribute was opened in 2015, so I’m doubtful this will make progress, and we might not even need it after all.
I tried that as well, using a straight-up Option<String>
to avoid any RefCell
overhead.
In this case we are using shared references, so its not quite as simple to cause major breakage as it would be with
a mutable reference. As we have to use unsafe
, all other restrictions still apply.
Using the nightly #[thread_local]
attribute gives these results for trivial, and complex types:
locals fastest │ slowest │ median │ mean
├─ _3_nightly_copy │ │ │
│ ├─ _1_global_override_unset │ │ │
│ │ ├─ t=1 1.133 µs │ 33.37 µs │ 1.164 µs │ 1.185 µs
│ │ ╰─ t=16 1.179 µs │ 9.845 µs │ 1.219 µs │ 1.25 µs
│ ├─ _2_local_fallback_set │ │ │
│ │ ├─ t=1 1.29 µs │ 22 µs │ 1.316 µs │ 1.351 µs
│ │ ╰─ t=16 1.331 µs │ 7.751 µs │ 1.386 µs │ 1.425 µs
│ ├─ _3_local_fallback_unset │ │ │
│ │ ├─ t=1 1.391 µs │ 19.19 µs │ 1.441 µs │ 1.463 µs
│ │ ╰─ t=16 1.481 µs │ 13.83 µs │ 1.552 µs │ 1.605 µs
│ ╰─ _4_global_override_set │ │ │
│ ├─ t=1 1.441 µs │ 23.64 µs │ 1.467 µs │ 1.498 µs
│ ╰─ t=16 1.523 µs │ 42.82 µs │ 1.613 µs │ 1.675 µs
╰─ _4_nightly_drop │ │ │
├─ _1_global_override_unset │ │ │
│ ├─ t=1 1.467 µs │ 74.43 µs │ 1.498 µs │ 1.538 µs
│ ╰─ t=16 1.573 µs │ 27.9 µs │ 1.654 µs │ 1.68 µs
├─ _2_local_fallback_set │ │ │
│ ├─ t=1 1.886 µs │ 56.67 µs │ 1.957 µs │ 1.991 µs
│ ╰─ t=16 2.25 µs │ 58.89 µs │ 2.542 µs │ 2.598 µs
├─ _3_local_fallback_unset │ │ │
│ ├─ t=1 1.806 µs │ 27.15 µs │ 1.871 µs │ 1.901 µs
│ ╰─ t=16 1.967 µs │ 40.45 µs │ 2.047 µs │ 2.129 µs
╰─ _4_global_override_set │ │ │
├─ t=1 2.047 µs │ 53.8 µs │ 2.069 µs │ 2.117 µs
╰─ t=16 2.321 µs │ 36.43 µs │ 2.594 µs │ 2.706 µs
Interestingly, this turns out being worse for trivial types, but ends up being a win for more complex types.
Not sure if the win comes from avoiding the RefCell
though. That remains an exercise for the reader.