More fun with SSO, part 2

3 May 2024 — 6 min

Recently I was taking a look at benchmarking various approaches to small string optimization, and you can read up on that here.

As a small recap, I want to turn a &[dyn Display] into a Vec<String>, or similar type, as efficiently as possible.

Even before writing that post, I have forked the smol_str crate and rewrote its internals around a hand crafted binary layout, using way too much unsafe code as some people may find acceptable. But as we will see, it is able to squeeze a bit more performance out of that.

Also, as I have seen that the thread_local! variant was way below my expectations in terms of performance, I changed it from using a Cell with take and set, to using a RefCell and with_borrow_mut. That can avoid a roundtrip through thread_local as well as copying some bytes around.

Here is the added benchmark in addition to the ones before. This is also now using a manual for loop and Vec::with_capacity instead of an Iterator and collect. That way, we have flatter debug info, and more readable flamegraphs. But it shouldn’t really change any of the performance outcomes.

pub fn smolbuf() {
    run(|tags| -> Option<Box<[Str24]>> {
        if tags.is_empty() {
            return None;
        }

        let mut string_buf = StringBuf::<128>::default();
        let mut collected_tags = Vec::with_capacity(tags.len());
        for tag in tags {
            string_buf.clear();
            write!(&mut string_buf, "{tag}").unwrap();
            collected_tags.push(Str24::new(string_buf.as_str()));
        }
        Some(collected_tags.into_boxed_slice())
    })
}

But the biggest difference now is that I have a native Linux and macOS with me to do some more benchmarking. Remember, in my previous post, I was running these benchmarks on a Windows machine, as well as within WSL.

And at least on Linux, I’m able to also follow some of the benchmarking best practices. Namely pinning the process to a specific core, disabling ASLR, and changing the automatic frequency scaling. This indeed makes the numbers a lot more stable.

Here are the results on Linux, running on a single core:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        463.6 ns      │ 34.47 µs      │ 493.6 ns      │ 508.1 ns
├─ _2_boxed_string      456.6 ns      │ 31.86 µs      │ 483.6 ns      │ 509.2 ns
├─ _3_boxed_boxed       537.6 ns      │ 32.33 µs      │ 580.6 ns      │ 599.1 ns
├─ _4_thread_local      476.6 ns      │ 32.23 µs      │ 509.6 ns      │ 532.6 ns
├─ _5_smallvec          489.6 ns      │ 33.91 µs      │ 519.6 ns      │ 543 ns
├─ _6_smolstr           482.6 ns      │ 34.88 µs      │ 494.6 ns      │ 527.3 ns
├─ _7_smallvec_smolstr  508.6 ns      │ 34.04 µs      │ 516.6 ns      │ 534.7 ns
╰─ _8_smolbuf           440.6 ns      │ 33.58 µs      │ 451.6 ns      │ 480 ns

We can see that the thread_local code fares a bit better than last time, and better than the Box<[Box<str>]> code which is re-allocating the String into a Box<str>. It is still worse than the simplest code returning a Vec<String>, but not by much.

The code with my forked smol_str code proves to be quite an improvement compared to the smallvec / smol_str code which it is based on, and it proves to be the fastest among the ones benchmarked.

Running the benchmark with DIVAN_THREADS=4 (this laptop only has a 2C/4T CPU) yields the following results:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        754.6 ns      │ 392.8 µs      │ 2.659 µs      │ 2.594 µs
├─ _2_boxed_string      732.6 ns      │ 84.01 µs      │ 2.673 µs      │ 2.534 µs
├─ _3_boxed_boxed       916.6 ns      │ 698.3 µs      │ 3.015 µs      │ 2.901 µs
├─ _4_thread_local      1.054 µs      │ 341.3 µs      │ 3.315 µs      │ 3.353 µs
├─ _5_smallvec          899.6 ns      │ 165.7 µs      │ 3.001 µs      │ 2.89 µs
├─ _6_smolstr           810.6 ns      │ 378.1 µs      │ 2.727 µs      │ 2.726 µs
├─ _7_smallvec_smolstr  741.6 ns      │ 368.7 µs      │ 2.813 µs      │ 2.798 µs
╰─ _8_smolbuf           820.6 ns      │ 34.09 µs      │ 2.599 µs      │ 2.563 µs

This yields a similar picture: The thread_local variant turns out to be quite bad, even though it should be especially beneficial in this scenario. The rewritten smol_str again turns out to be ahead (looking at the median numbers), but just barely.

And now onto the elephant in the room: macOS.

I’m a bit unsure about how to apply benchmarking best practices here, and I admit that the numbers I got were quite unstable. But here they are, running single-threaded. Again, I am primarily looking at the median.

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        1.424 µs      │ 136 µs        │ 1.499 µs      │ 1.637 µs
├─ _2_boxed_string      1.474 µs      │ 68.11 µs      │ 1.679 µs      │ 1.766 µs
├─ _3_boxed_boxed       1.608 µs      │ 69.31 µs      │ 1.901 µs      │ 1.984 µs
├─ _4_thread_local      1.547 µs      │ 99.33 µs      │ 1.644 µs      │ 1.711 µs
├─ _5_smallvec          1.502 µs      │ 67.44 µs      │ 1.578 µs      │ 1.629 µs
├─ _6_smolstr           872.7 ns      │ 66.57 µs      │ 933.7 ns      │ 972.4 ns
├─ _7_smallvec_smolstr  836.7 ns      │ 101.4 µs      │ 868.7 ns      │ 911.7 ns
╰─ _8_smolbuf           785.7 ns      │ 114.4 µs      │ 815.7 ns      │ 853.4 ns

What we can see is that finally, the thread_local version turns out to be an improvement compared to the (re)-allocating one. And all the SSO-based versions are considerably faster than the ones allocating heap memory.

Here are the numbers running with 16 threads:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        1.962 µs      │ 131.8 µs      │ 3.975 µs      │ 4.273 µs
├─ _2_boxed_string      1.992 µs      │ 69.95 µs      │ 4.068 µs      │ 4.404 µs
├─ _3_boxed_boxed       2.169 µs      │ 65.09 µs      │ 4.46 µs       │ 4.796 µs
├─ _4_thread_local      2.236 µs      │ 31.15 µs      │ 4.862 µs      │ 5.142 µs
├─ _5_smallvec          1.882 µs      │ 29.93 µs      │ 3.885 µs      │ 4.129 µs
├─ _6_smolstr           1.202 µs      │ 22.67 µs      │ 2.706 µs      │ 2.825 µs
├─ _7_smallvec_smolstr  1.161 µs      │ 53.01 µs      │ 2.617 µs      │ 2.776 µs
╰─ _8_smolbuf           1.071 µs      │ 20.43 µs      │ 2.465 µs      │ 2.603 µs

The thread_local version is again losing out, but the win of the non-allocating SSO-based versions is very clear.

# Conclusion

It is definitely fun obsessing about these tiny details, and chasing the smallest of improvements.

But the results have also shown that the simplest solution is often also the fastest, surprising as it may be.

However, one important thing to note here is that these were micro-benchmarks. Even the cases that are allocating memory only hold on to that for a short period of time, and free things right away. Things might look very different when other parts of the code are allocating heap memory all over the place, and when the resulting allocation is held on for a longer time, and potentially being freed on a different thread.

Not to mention that a ton of small (re)-allocations can lead to increased heap fragmentation.

All in all, I believe using smol_str, or my fork / rewrite that I’m supposed to publish at one point, is worth it in general. Even though its quite an effort to get it to beat the simplest solution in micro-benchmarks.

In the end, one should always look at a macro-benchmark instead, as we are definitely interested in real world numbers at the end of the day.

Another fun fact and outcome from this whole analysis is also that for this micro-benchmark, Linux is way faster than both Windows and macOS.

Not only is WSL twice as fast as the Windows version running on the same hardware, but my aging 2017 i7-7500U has similar single thread performance to my 2018 Ryzen 2700X, and easily beating the 2019 i9-9880H running macOS.

The low number of hardware cores/threads is showing though, but the 7 year old hardware is still enough to do meaningful Rust development on.