More fun with SSO, part 2
— 5 minRecently I was taking a look at benchmarking various approaches to small string optimization, and you can read up on that here.
As a small recap, I want to turn a &[dyn Display]
into a Vec<String>
, or similar type, as efficiently as possible.
Even before writing that post, I have forked the smol_str
crate and
rewrote its internals around a hand crafted binary layout, using way too much unsafe
code as some people may find
acceptable. But as we will see, it is able to squeeze a bit more performance out of that.
Also, as I have seen that the thread_local!
variant was way below my expectations in terms of performance,
I changed it from using a Cell
with take
and set
, to using a RefCell
and with_borrow_mut
. That can avoid
a roundtrip through thread_local
as well as copying some bytes around.
Here is the added benchmark in addition to the ones before. This is also now using a manual for loop and Vec::with_capacity
instead of an Iterator
and collect
. That way, we have flatter debug info, and more readable flamegraphs.
But it shouldn’t really change any of the performance outcomes.
pub fn smolbuf() {
run(|tags| -> Option<Box<[Str24]>> {
if tags.is_empty() {
return None;
}
let mut string_buf = StringBuf::<128>::default();
let mut collected_tags = Vec::with_capacity(tags.len());
for tag in tags {
string_buf.clear();
write!(&mut string_buf, "{tag}").unwrap();
collected_tags.push(Str24::new(string_buf.as_str()));
}
Some(collected_tags.into_boxed_slice())
})
}
But the biggest difference now is that I have a native Linux and macOS with me to do some more benchmarking. Remember, in my previous post, I was running these benchmarks on a Windows machine, as well as within WSL.
And at least on Linux, I’m able to also follow some of the benchmarking best practices. Namely pinning the process to a specific core, disabling ASLR, and changing the automatic frequency scaling. This indeed makes the numbers a lot more stable.
Here are the results on Linux, running on a single core:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 463.6 ns │ 34.47 µs │ 493.6 ns │ 508.1 ns
├─ _2_boxed_string 456.6 ns │ 31.86 µs │ 483.6 ns │ 509.2 ns
├─ _3_boxed_boxed 537.6 ns │ 32.33 µs │ 580.6 ns │ 599.1 ns
├─ _4_thread_local 476.6 ns │ 32.23 µs │ 509.6 ns │ 532.6 ns
├─ _5_smallvec 489.6 ns │ 33.91 µs │ 519.6 ns │ 543 ns
├─ _6_smolstr 482.6 ns │ 34.88 µs │ 494.6 ns │ 527.3 ns
├─ _7_smallvec_smolstr 508.6 ns │ 34.04 µs │ 516.6 ns │ 534.7 ns
╰─ _8_smolbuf 440.6 ns │ 33.58 µs │ 451.6 ns │ 480 ns
We can see that the thread_local
code fares a bit better than last time, and better than the Box<[Box<str>]>
code
which is re-allocating the String
into a Box<str>
.
It is still worse than the simplest code returning a Vec<String>
, but not by much.
The code with my forked smol_str
code proves to be quite an improvement compared to the smallvec
/ smol_str
code which it is based on, and it proves to be the fastest among the ones benchmarked.
Running the benchmark with DIVAN_THREADS=4
(this laptop only has a 2C/4T CPU) yields the following results:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 754.6 ns │ 392.8 µs │ 2.659 µs │ 2.594 µs
├─ _2_boxed_string 732.6 ns │ 84.01 µs │ 2.673 µs │ 2.534 µs
├─ _3_boxed_boxed 916.6 ns │ 698.3 µs │ 3.015 µs │ 2.901 µs
├─ _4_thread_local 1.054 µs │ 341.3 µs │ 3.315 µs │ 3.353 µs
├─ _5_smallvec 899.6 ns │ 165.7 µs │ 3.001 µs │ 2.89 µs
├─ _6_smolstr 810.6 ns │ 378.1 µs │ 2.727 µs │ 2.726 µs
├─ _7_smallvec_smolstr 741.6 ns │ 368.7 µs │ 2.813 µs │ 2.798 µs
╰─ _8_smolbuf 820.6 ns │ 34.09 µs │ 2.599 µs │ 2.563 µs
This yields a similar picture: The thread_local
variant turns out to be quite bad, even though it should be
especially beneficial in this scenario.
The rewritten smol_str
again turns out to be ahead (looking at the median numbers), but just barely.
And now onto the elephant in the room: macOS.
I’m a bit unsure about how to apply benchmarking best practices here, and I admit that the numbers I got were quite unstable. But here they are, running single-threaded. Again, I am primarily looking at the median.
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 1.424 µs │ 136 µs │ 1.499 µs │ 1.637 µs
├─ _2_boxed_string 1.474 µs │ 68.11 µs │ 1.679 µs │ 1.766 µs
├─ _3_boxed_boxed 1.608 µs │ 69.31 µs │ 1.901 µs │ 1.984 µs
├─ _4_thread_local 1.547 µs │ 99.33 µs │ 1.644 µs │ 1.711 µs
├─ _5_smallvec 1.502 µs │ 67.44 µs │ 1.578 µs │ 1.629 µs
├─ _6_smolstr 872.7 ns │ 66.57 µs │ 933.7 ns │ 972.4 ns
├─ _7_smallvec_smolstr 836.7 ns │ 101.4 µs │ 868.7 ns │ 911.7 ns
╰─ _8_smolbuf 785.7 ns │ 114.4 µs │ 815.7 ns │ 853.4 ns
What we can see is that finally, the thread_local
version turns out to be an improvement compared to the (re)-allocating one.
And all the SSO-based versions are considerably faster than the ones allocating heap memory.
Here are the numbers running with 16 threads:
divan fastest │ slowest │ median │ mean
├─ _1_vec_string 1.962 µs │ 131.8 µs │ 3.975 µs │ 4.273 µs
├─ _2_boxed_string 1.992 µs │ 69.95 µs │ 4.068 µs │ 4.404 µs
├─ _3_boxed_boxed 2.169 µs │ 65.09 µs │ 4.46 µs │ 4.796 µs
├─ _4_thread_local 2.236 µs │ 31.15 µs │ 4.862 µs │ 5.142 µs
├─ _5_smallvec 1.882 µs │ 29.93 µs │ 3.885 µs │ 4.129 µs
├─ _6_smolstr 1.202 µs │ 22.67 µs │ 2.706 µs │ 2.825 µs
├─ _7_smallvec_smolstr 1.161 µs │ 53.01 µs │ 2.617 µs │ 2.776 µs
╰─ _8_smolbuf 1.071 µs │ 20.43 µs │ 2.465 µs │ 2.603 µs
The thread_local
version is again losing out, but the win of the non-allocating SSO-based versions is very clear.
# Conclusion
It is definitely fun obsessing about these tiny details, and chasing the smallest of improvements.
But the results have also shown that the simplest solution is often also the fastest, surprising as it may be.
However, one important thing to note here is that these were micro-benchmarks. Even the cases that are allocating
memory only hold on to that for a short period of time, and free things right away. Things might look very different
when other parts of the code are allocating heap memory all over the place, and when the resulting allocation is
held on for a longer time, and potentially being free
d on a different thread.
Not to mention that a ton of small (re)-allocations can lead to increased heap fragmentation.
All in all, I believe using smol_str
, or my fork / rewrite that I’m supposed to publish at one point,
is worth it in general. Even though its quite an effort to get it to beat the simplest solution in micro-benchmarks.
In the end, one should always look at a macro-benchmark instead, as we are definitely interested in real world numbers at the end of the day.
Another fun fact and outcome from this whole analysis is also that for this micro-benchmark, Linux is way faster than both Windows and macOS.
Not only is WSL twice as fast as the Windows version running on the same hardware, but my aging 2017 i7-7500U has similar single thread performance to my 2018 Ryzen 2700X, and easily beating the 2019 i9-9880H running macOS.
The low number of hardware cores/threads is showing though, but the 7 year old hardware is still enough to do meaningful Rust development on.