Fun with benchmarking small string optimization
— 11 minSometimes I really like to go down a rabbit hole and obsess about some tiny details.
Today is such a day, and I am benchmarking, profiling and trying to micro-optimize a piece of Rust code, specifically looking at small string optimization.
My use-case is as following: I have a slice of &dyn Display
, and I want to
stringify and capture those values for later.
To have some comparable test-cases, here is also a small benchmarking function that calls our actual implementation with a bunch of examples:
type InputTags<'a> = &'a [&'a dyn std::fmt::Display];
fn run<F, R>(f: F)
where
F: Fn(InputTags) -> R,
{
use std::hint::black_box;
let tags: InputTags = &[];
black_box(f(black_box(tags)));
let tags: InputTags = &[&true];
black_box(f(black_box(tags)));
let tags: InputTags = &[&123u32, &456u64];
black_box(f(black_box(tags)));
let tags: InputTags = &[&123f32, &b'b', &'c'];
black_box(f(black_box(tags)));
let tags: InputTags = &[
&"some more, ",
&"and a bit longer",
&"tag values.",
&"with one a bit longer than the capacity of a smol_str",
];
black_box(f(black_box(tags)));
}
As you can see, this is also using std::hint::black_box
to prevent the compiler
from just optimizing everything away.
A straight forward and extremely simple implementation is this is this one-liner:
pub fn vec_string() {
run(|tags| -> Vec<String> { tags.iter().map(|d| d.to_string()).collect() })
}
My assumption though is that it is suboptimal for a number of reasons I want to improve upon.
- Both the
Vec
andString
are growable types. - Both are thus wasting a
capacity: usize
that is not needed. - In particular the
String
might have some spare capacity at the end that is not needed. - The
String
has an indirection to the heap, and an allocation.
Another thing that is not quite obvious from this example is that I want to also optimize the on-stack size of the result. So I am mostly focused on the performance and allocation of the inner string, and not the outer container.
We can get rid of the excess capacity
(and the usize
) by switching to a boxed slice / str.
Here, my assumptions are this:
- Switching to
Box<[String]>
should be quite cheap, as we know the length / allocation ahead of time. - Switching to
Box<str>
however will likely result in a re-allocation. - However, we are saving
8 bytes
per string, so that should be a positive on the total memory usage.
To validate the first two points, I separated both moves into their own functions:
pub fn boxed_string() {
run(|tags| -> Option<Box<[String]>> {
if tags.is_empty() {
return None;
}
let collected_tags: Vec<_> = tags.iter().map(|d| d.to_string()).collect();
Some(collected_tags.into_boxed_slice())
})
}
pub fn boxed_boxed() {
run(|tags| -> Option<Box<[Box<str>]>> {
if tags.is_empty() {
return None;
}
let collected_tags: Vec<_> = tags
.iter()
.map(|d| d.to_string().into_boxed_str())
.collect();
Some(collected_tags.into_boxed_slice())
})
}
As you can see, I actually made these an Option<Box>
, as even an empty Box
would incur an allocation, and I would
like to avoid those. Due to nice optimization, the Option<Box>
is still only 16 bytes
on-stack.
The into_boxed_str
call will most likely result in a re-allocation and copy, plus a subsequent free
of the growable String
.
I have previously seen in profiles that these allocations can be quite hot, and I have also had some success switching
to a persistent thread_local
allocation.
My assumption there is that relaying on a thread-local is beneficial compared to using the system allocator, in particular in multi-threaded code.
So here is the next step, using a persistent thread_local
String
:
pub fn thread_local() {
use std::cell::Cell;
thread_local! {
static BUFFER: Cell<String> = const { Cell::new(String::new()) };
}
run(|tags| -> Option<Box<[Box<str>]>> {
if tags.is_empty() {
return None;
}
let mut string_buf = BUFFER.take();
let collected_tags: Vec<_> = tags
.iter()
.map(|d| {
string_buf.clear();
write!(&mut string_buf, "{d}").unwrap();
string_buf.as_str().into()
})
.collect();
BUFFER.set(string_buf);
Some(collected_tags.into_boxed_slice())
})
}
I opted to use a Cell
, combined with take()
here, although a RefCell
with with_borrow_mut
might be equally good,
or even better. In this case, I wanted to avoid having to use yet another closure.
One valid feedback here might be that we are still going to the heap, through a thread_local
indirection.
Why not just stay on the stack altogether?
Very good point, and one motivating point for this thread_local
solution is that it is using only std
types.
If we want to pull in code from crates.io
, we can for example use the smallvec
crate to keep a stack-local buffer,
and then copy bytes over to the final Box<str>
from there.
The assumption is that this should avoid thread_local
overhead which unfortunately can be quite significant,
as I have written about previously.
Here is a small wrapper around smallvec
that acts as a String
:
#[derive(Default)]
struct StringBuf<const N: usize> {
buf: smallvec::SmallVec<u8, N>,
}
impl<const N: usize> StringBuf<N> {
pub fn as_str(&self) -> &str {
unsafe { std::str::from_utf8_unchecked(&self.buf) }
}
pub fn clear(&mut self) {
self.buf.clear()
}
}
impl<const N: usize> Write for StringBuf<N> {
fn write_str(&mut self, s: &str) -> std::fmt::Result {
self.buf.extend_from_slice(s.as_bytes());
Ok(())
}
}
And here is the final code using that wrapper:
pub fn smallvec() {
run(|tags| -> Option<Box<[Box<str>]>> {
if tags.is_empty() {
return None;
}
let mut string_buf = StringBuf::<128>::default();
let collected_tags: Vec<_> = tags
.iter()
.map(|d| {
string_buf.clear();
write!(&mut string_buf, "{d}").unwrap();
string_buf.as_str().into()
})
.collect();
Some(collected_tags.into_boxed_slice())
})
}
Next up, we are still doing an allocation and a copy for the Box<str>
.
Lets get to the meat of this blog post by finally introducing small string optimization.
I will use the smol_str
crate for this,
which also comes with an incredibly convenient format_smolstr!
macro:
pub fn smolstr() {
run(|tags| -> Option<Box<[SmolStr]>> {
if tags.is_empty() {
return None;
}
let collected_tags: Vec<_> = tags.iter().map(|d| format_smolstr!("{d}")).collect();
Some(collected_tags.into_boxed_slice())
})
}
The assumption is that we should see quite an improvement by avoiding all the allocations.
However, the test function has one string that is too large and will spill to the heap.
My assumption is that smol_str
might not handle that edge-case very well, as I think it is spilling to a String
,
and then doing yet another re-allocation with the Arc
equivalent of into_boxed_str
.
Just to validate that case, I built another variant, combining the two:
pub fn smallvec_smolstr() {
run(|tags| -> Option<Box<[SmolStr]>> {
if tags.is_empty() {
return None;
}
let mut string_buf = StringBuf::<128>::default();
let collected_tags: Vec<_> = tags
.iter()
.map(|d| {
string_buf.clear();
write!(&mut string_buf, "{d}").unwrap();
string_buf.as_str().into()
})
.collect();
Some(collected_tags.into_boxed_slice())
})
}
# Benchmarking
So I have my testcase, and I have written a bunch of implementations with some assumptions of how they should behave.
Before we jump into the results, I want to highlight that I did not follow benchmarking best practices, which would be:
- Disable automatic frequency scaling, so things are less effected by thermal management.
- Disable hyper-threading, and
- Pin threads to specific cores to avoid scheduling / context switching overhead.
- Keep the system as idle as possible.
Well, ain’t nobody got time fo dat. I have my IDE and browser with music running here in the background. I also can’t be bothered to figure out how to even disable all the CPU features that could lead to unstable benchmark results. And as we will see, the results can get very unstable indeed.
For benchmarking, I am using the divan
crate here.
While criterion
is often said to provide more stable results,
I do quite like the convenience of divan
, especially being able to trivially run the code multi-threaded.
To try to get some more stable results, I also experimented with iai-callgrind
.
But lets take a look at the divan
results first. Here is a benchmark run on Windows:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 1.012 µs │ 1.524 µs │ 1.037 µs │ 1.048 µs
├─ _2_boxed_string 993.5 ns │ 2.343 µs │ 1.024 µs │ 1.055 µs
├─ _3_boxed_boxed 1.499 µs │ 5.112 µs │ 1.574 µs │ 1.637 µs
├─ _4_thread_local 1.099 µs │ 1.862 µs │ 1.124 µs │ 1.144 µs
├─ _5_smallvec 1.099 µs │ 1.637 µs │ 1.131 µs │ 1.137 µs
├─ _6_smolstr 781 ns │ 1.412 µs │ 793.5 ns │ 803.2 ns
╰─ _7_smallvec_smolstr 756 ns │ 999.7 ns │ 762.2 ns │ 772 ns
If we look at both the fastest and median time, we can see the following differences:
- Going from
Vec<String>
toBox<[String]>
is a wash. - The re-allocation going from
String
toBox<str>
indeed hurts quite a bit. - Using either
thread_local
or stack-allocatedsmallvec
improves upon that. - However, it seems to be slightly slower than the naive
Vec<String>
implementation. - Using
smol_str
is quite a win compared to the baseline. - Surprisingly, using both
smallvec
andsmol_str
in another tiny win.
To give an example of the very unstable results I seem to get, here are the results running the exact same code another time:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 1.018 µs │ 2.581 µs │ 2.106 µs │ 2.048 µs
├─ _2_boxed_string 1.037 µs │ 2.449 µs │ 2.112 µs │ 2.04 µs
├─ _3_boxed_boxed 1.537 µs │ 6.412 µs │ 2.537 µs │ 2.562 µs
├─ _4_thread_local 1.118 µs │ 2.481 µs │ 2.143 µs │ 2.094 µs
├─ _5_smallvec 1.106 µs │ 17.63 µs │ 2.134 µs │ 2.8 µs
├─ _6_smolstr 899.7 ns │ 10.99 µs │ 2.699 µs │ 2.955 µs
╰─ _7_smallvec_smolstr 762.2 ns │ 4.043 µs │ 2.743 µs │ 2.717 µs
For some reason, the median time is around twice it was before across the board, with the median for the stack-allocated code appearing quite a bit slower now. The fastest times however are very much on par with the previous results however.
As I mentioned, one nice thing about divan
is that you can run the benchmarks in multiple threads,
simply by adding DIVAN_THREADS=16
to your environment.
Now we can see something like:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 1.699 µs │ 204.8 µs │ 4.199 µs │ 5.569 µs
├─ _2_boxed_string 1.799 µs │ 70.49 µs │ 4.199 µs │ 4.913 µs
├─ _3_boxed_boxed 2.299 µs │ 89.49 µs │ 5.099 µs │ 5.796 µs
├─ _4_thread_local 2.499 µs │ 110.7 µs │ 5.899 µs │ 7.752 µs
├─ _5_smallvec 1.599 µs │ 60.59 µs │ 3.199 µs │ 3.638 µs
├─ _6_smolstr 1.399 µs │ 82.89 µs │ 3.749 µs │ 4.255 µs
╰─ _7_smallvec_smolstr 1.099 µs │ 20.97 µs │ 1.949 µs │ 2.043 µs
Here we can see:
smol_str
combined withsmallvec
is the clear winner here.- Surprisingly,
thread_local
is quite a pessimization, even running things multi-threaded.
As mentioned previously, I am running 16
threads here on loaded Windows with a
Ryzen 2700X CPU which has a 8C/16T
configuration.
The thread_local
results however were really surprising. I ran quite a bit of
benchmarking and profiling on a macOS which showed clear wins.
Possibly because the macOS system allocator is particularly slow?
I don’t really want to turn on the MacBook right now as I am on sabbatical right now. But I can try running the same benchmarks on Linux, or rather Windows Subsystem for Linux (WSL) to be exact.
Here are the results from Linux running in WSL on the exact Windows system I used to capture the above results:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 447.7 ns │ 3.622 µs │ 457.7 ns │ 494.1 ns
├─ _2_boxed_string 452.5 ns │ 488.5 ns │ 455 ns │ 455.3 ns
├─ _3_boxed_boxed 564.5 ns │ 612.2 ns │ 567 ns │ 566.8 ns
├─ _4_thread_local 561.7 ns │ 3.613 µs │ 571.7 ns │ 604.6 ns
├─ _5_smallvec 557.5 ns │ 3.377 µs │ 562.2 ns │ 590.7 ns
├─ _6_smolstr 504.7 ns │ 2.164 µs │ 513.7 ns │ 532.5 ns
╰─ _7_smallvec_smolstr 526.5 ns │ 586.2 ns │ 528.7 ns │ 528.9 ns
There is quite some surprising results here:
- Linux seems to be twice as fast as Windows, even running within WSL.
- Again, the
thread_local
variant is slower than just directly allocating. - Most surprisingly, also using
smol_str
andsmallvec
is slower as well.
Interestingly, the results are quite more stable than on Windows, possibly because of a better CPU scheduler. I didn’t see wildly different results when I re-ran the same benchmark.
Running the benchmarks multi-threaded via DIVAN_THREADS=16
on Linux gives these results:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 1.533 µs │ 118.4 µs │ 2.171 µs │ 2.344 µs
├─ _2_boxed_string 1.69 µs │ 37.37 µs │ 2.217 µs │ 2.311 µs
├─ _3_boxed_boxed 997.7 ns │ 36.98 µs │ 2.383 µs │ 2.476 µs
├─ _4_thread_local 1.829 µs │ 146.5 µs │ 3.234 µs │ 3.394 µs
├─ _5_smallvec 1.431 µs │ 97.89 µs │ 2.346 µs │ 2.648 µs
├─ _6_smolstr 1.718 µs │ 31.45 µs │ 2.707 µs │ 2.932 µs
╰─ _7_smallvec_smolstr 1.287 µs │ 46.39 µs │ 2.4 µs │ 2.737 µs
These results are similar to the ones above:
- The simplest implementation seems to be the fastest.
thread_local
does not seem to be an improvement over just allocating directly.smol_str
andsmallvec
are not providing any gains here.
# Callgrind and Profiling
I mentioned that I was also experimenting with iai-callgrind
,
so lets look at those results.
callgrind::bench_group::_1_vec_string
Instructions: 5272|5272 (No change)
L1 Hits: 7421|7421 (No change)
L2 Hits: 8|8 (No change)
RAM Hits: 122|122 (No change)
Total read+write: 7551|7551 (No change)
Estimated Cycles: 11731|11731 (No change)
callgrind::bench_group::_2_boxed_string
Instructions: 5326|5326 (No change)
L1 Hits: 7492|7492 (No change)
L2 Hits: 8|8 (No change)
RAM Hits: 126|126 (No change)
Total read+write: 7626|7626 (No change)
Estimated Cycles: 11942|11942 (No change)
callgrind::bench_group::_3_boxed_boxed
Instructions: 6380|6380 (No change)
L1 Hits: 8887|8887 (No change)
L2 Hits: 12|12 (No change)
RAM Hits: 131|131 (No change)
Total read+write: 9030|9030 (No change)
Estimated Cycles: 13532|13532 (No change)
callgrind::bench_group::_4_thread_local
Instructions: 7065|7065 (No change)
L1 Hits: 9998|9998 (No change)
L2 Hits: 20|20 (No change)
RAM Hits: 139|139 (No change)
Total read+write: 10157|10157 (No change)
Estimated Cycles: 14963|14963 (No change)
callgrind::bench_group::_5_smallvec
Instructions: 5851|5851 (No change)
L1 Hits: 8236|8236 (No change)
L2 Hits: 12|12 (No change)
RAM Hits: 133|133 (No change)
Total read+write: 8381|8381 (No change)
Estimated Cycles: 12951|12951 (No change)
callgrind::bench_group::_6_smolstr
Instructions: 5035|5035 (No change)
L1 Hits: 7226|7226 (No change)
L2 Hits: 11|11 (No change)
RAM Hits: 138|138 (No change)
Total read+write: 7375|7375 (No change)
Estimated Cycles: 12111|12111 (No change)
callgrind::bench_group::_7_smallvec_smolstr
Instructions: 5201|5201 (No change)
L1 Hits: 7373|7373 (No change)
L2 Hits: 12|12 (No change)
RAM Hits: 133|133 (No change)
Total read+write: 7518|7518 (No change)
Estimated Cycles: 12088|12088 (No change)
This is quite a lot to take in. The first thing you see in (No change)
. Indeed the results are stable.
Re-running the benchmark yields the exact same numbers.
However, what are we even looking at? Wall time numbers are very obvious, and its also very obvious that lower is better. But which numbers are we trying to improve here exactly? Are we aiming for Instructions, or rather Estimated Cycles, which itself is derived from the L1, L2 and RAM numbers.
It is hard to say what exactly we are optimizing for. But we can see that the smol_str
variant is winning
when looking for the lowest Instructions. The simplest implementation also has the lowest Estimated Cycles,
with the smol_str
/ smallvec
solution being the runner-up.
Last but not least, lets do some profiling as well. I’m a huge fan of samply
.
When doing profiling, you should keep in mind that you pulling a Heisenberg here, as you are changing the outcome by measuring it.
Although the overhead of profiling is negligibly small, and did not make any difference in my particular case here.
Here is a flamegraph for the simplest implementation using Vec<String>
:
As expected, we see a lot of alloc
nested within the fmt
call.
A bit surprising here is also the high cost of drop
-ing all the allocated data.
Compare that with the flamegraph of the smol_str
and smallvec
solution:
In this flamegraph we can now clearly see the fmt
related code, separate from the smol_str
code,
again separate from the allocation of the outer Box
.
However, there is still some surprises hidden in this flamegraph:
SmolStr::new
is surprisingly expensive, taking as much time asfmt
-ing into the stack-allocatedsmallvec
.- I see a big
core::ptr::write
there, which I assume is copying theSmolStr
to theVec
. - The
Iterator::collect
abstraction leads to a very deep and complex call graph, though I assume all of that to be inlined.
# Conclusion
As I mentioned in the intro, I really enjoy obsessing over these tiny details. It was fun trying to micro-optimize this particular use-case, although one of the main outcomes of it might as well be:
There is no need to micro-optimize things this hard, as the simplest solution is already good enough in most cases. Some optimizations that I assumed would be beneficial even turned out not to be.
Another thing to note here though is that this is a micro benchmark. It was already hard to get halfway stable results. Not to mention that micro benchmarks can also be sensitive to very minute details like code alignment.
I definitely like macro benchmarks and profiling, aka looking at flamegraphs a lot more. It is much easier to see both opportunities as well as improvements that way.
Speaking of opportunities, SmolStr::new
looks much slower than it needs to be, so I might try to optimize it further.
The core::ptr::write
I mentioned is also interesting, as I would have assumed that the compiler is smart enough to
write the SmolStr
directly into the allocation.
I believe the nightly-only new_uninit
feature might help there.
At the very least, that feature would allow returning a Arc<[T]>
instead of a Box<[T]>
without having to re-allocate.
Converting from a Vec<T>
to a Box<[T]>
is pretty much free as long as the Vec
is allocated with the final size known
ahead of time. That is not possible for Arc
, as the reference counts have to live somewhere as well.
Arc::new_uninit_slice
would solve that, as I believe its pretty much the same as Vec::with_capacity
.
Except it requires a bit of unsafe
code to use.