Fun with benchmarking small string optimization

28 April 2024 — 16 min

Sometimes I really like to go down a rabbit hole and obsess about some tiny details.

Today is such a day, and I am benchmarking, profiling and trying to micro-optimize a piece of Rust code, specifically looking at small string optimization.

My use-case is as following: I have a slice of &dyn Display, and I want to stringify and capture those values for later.

To have some comparable test-cases, here is also a small benchmarking function that calls our actual implementation with a bunch of examples:

type InputTags<'a> = &'a [&'a dyn std::fmt::Display];

fn run<F, R>(f: F)
where
    F: Fn(InputTags) -> R,
{
    use std::hint::black_box;

    let tags: InputTags = &[];
    black_box(f(black_box(tags)));

    let tags: InputTags = &[&true];
    black_box(f(black_box(tags)));

    let tags: InputTags = &[&123u32, &456u64];
    black_box(f(black_box(tags)));

    let tags: InputTags = &[&123f32, &b'b', &'c'];
    black_box(f(black_box(tags)));

    let tags: InputTags = &[
        &"some more, ",
        &"and a bit longer",
        &"tag values.",
        &"with one a bit longer than the capacity of a smol_str",
    ];
    black_box(f(black_box(tags)));
}

As you can see, this is also using std::hint::black_box to prevent the compiler from just optimizing everything away.

A straight forward and extremely simple implementation is this is this one-liner:

pub fn vec_string() {
    run(|tags| -> Vec<String> { tags.iter().map(|d| d.to_string()).collect() })
}

My assumption though is that it is suboptimal for a number of reasons I want to improve upon.

Both the Vec and String are growable types.
Both are thus wasting a capacity: usize that is not needed.
In particular the String might have some spare capacity at the end that is not needed.
The String has an indirection to the heap, and an allocation.

Another thing that is not quite obvious from this example is that I want to also optimize the on-stack size of the result. So I am mostly focused on the performance and allocation of the inner string, and not the outer container.

We can get rid of the excess capacity (and the usize) by switching to a boxed slice / str.

Here, my assumptions are this:

Switching to Box<[String]> should be quite cheap, as we know the length / allocation ahead of time.
Switching to Box<str> however will likely result in a re-allocation.
However, we are saving 8 bytes per string, so that should be a positive on the total memory usage.

To validate the first two points, I separated both moves into their own functions:

pub fn boxed_string() {
    run(|tags| -> Option<Box<[String]>> {
        if tags.is_empty() {
            return None;
        }

        let collected_tags: Vec<_> = tags.iter().map(|d| d.to_string()).collect();
        Some(collected_tags.into_boxed_slice())
    })
}

pub fn boxed_boxed() {
    run(|tags| -> Option<Box<[Box<str>]>> {
        if tags.is_empty() {
            return None;
        }

        let collected_tags: Vec<_> = tags
            .iter()
            .map(|d| d.to_string().into_boxed_str())
            .collect();
        Some(collected_tags.into_boxed_slice())
    })
}

As you can see, I actually made these an Option<Box>, as even an empty Box would incur an allocation, and I would like to avoid those. Due to nice optimization, the Option<Box> is still only 16 bytes on-stack.

The into_boxed_str call will most likely result in a re-allocation and copy, plus a subsequent free of the growable String. I have previously seen in profiles that these allocations can be quite hot, and I have also had some success switching to a persistent thread_local allocation.

My assumption there is that relaying on a thread-local is beneficial compared to using the system allocator, in particular in multi-threaded code.

So here is the next step, using a persistent thread_local String:

pub fn thread_local() {
    use std::cell::Cell;

    thread_local! {
        static BUFFER: Cell<String> = const { Cell::new(String::new()) };
    }

    run(|tags| -> Option<Box<[Box<str>]>> {
        if tags.is_empty() {
            return None;
        }

        let mut string_buf = BUFFER.take();

        let collected_tags: Vec<_> = tags
            .iter()
            .map(|d| {
                string_buf.clear();
                write!(&mut string_buf, "{d}").unwrap();
                string_buf.as_str().into()
            })
            .collect();

        BUFFER.set(string_buf);

        Some(collected_tags.into_boxed_slice())
    })
}

I opted to use a Cell, combined with take() here, although a RefCell with with_borrow_mut might be equally good, or even better. In this case, I wanted to avoid having to use yet another closure.

One valid feedback here might be that we are still going to the heap, through a thread_local indirection. Why not just stay on the stack altogether?

Very good point, and one motivating point for this thread_local solution is that it is using only std types. If we want to pull in code from crates.io, we can for example use the smallvec crate to keep a stack-local buffer, and then copy bytes over to the final Box<str> from there.

The assumption is that this should avoid thread_local overhead which unfortunately can be quite significant, as I have written about previously.

Here is a small wrapper around smallvec that acts as a String:

#[derive(Default)]
struct StringBuf<const N: usize> {
    buf: smallvec::SmallVec<u8, N>,
}

impl<const N: usize> StringBuf<N> {
    pub fn as_str(&self) -> &str {
        unsafe { std::str::from_utf8_unchecked(&self.buf) }
    }
    pub fn clear(&mut self) {
        self.buf.clear()
    }
}

impl<const N: usize> Write for StringBuf<N> {
    fn write_str(&mut self, s: &str) -> std::fmt::Result {
        self.buf.extend_from_slice(s.as_bytes());
        Ok(())
    }
}

And here is the final code using that wrapper:

pub fn smallvec() {
    run(|tags| -> Option<Box<[Box<str>]>> {
        if tags.is_empty() {
            return None;
        }

        let mut string_buf = StringBuf::<128>::default();

        let collected_tags: Vec<_> = tags
            .iter()
            .map(|d| {
                string_buf.clear();
                write!(&mut string_buf, "{d}").unwrap();
                string_buf.as_str().into()
            })
            .collect();
        Some(collected_tags.into_boxed_slice())
    })
}

Next up, we are still doing an allocation and a copy for the Box<str>.

Lets get to the meat of this blog post by finally introducing small string optimization. I will use the smol_str crate for this, which also comes with an incredibly convenient format_smolstr! macro:

pub fn smolstr() {
    run(|tags| -> Option<Box<[SmolStr]>> {
        if tags.is_empty() {
            return None;
        }
        let collected_tags: Vec<_> = tags.iter().map(|d| format_smolstr!("{d}")).collect();
        Some(collected_tags.into_boxed_slice())
    })
}

The assumption is that we should see quite an improvement by avoiding all the allocations.

However, the test function has one string that is too large and will spill to the heap. My assumption is that smol_str might not handle that edge-case very well, as I think it is spilling to a String, and then doing yet another re-allocation with the Arc equivalent of into_boxed_str.

Just to validate that case, I built another variant, combining the two:

pub fn smallvec_smolstr() {
    run(|tags| -> Option<Box<[SmolStr]>> {
        if tags.is_empty() {
            return None;
        }

        let mut string_buf = StringBuf::<128>::default();

        let collected_tags: Vec<_> = tags
            .iter()
            .map(|d| {
                string_buf.clear();
                write!(&mut string_buf, "{d}").unwrap();
                string_buf.as_str().into()
            })
            .collect();
        Some(collected_tags.into_boxed_slice())
    })
}

# Benchmarking

So I have my testcase, and I have written a bunch of implementations with some assumptions of how they should behave.

Before we jump into the results, I want to highlight that I did not follow benchmarking best practices, which would be:

Disable automatic frequency scaling, so things are less effected by thermal management.
Disable hyper-threading, and
Pin threads to specific cores to avoid scheduling / context switching overhead.
Keep the system as idle as possible.

Well, ain’t nobody got time fo dat. I have my IDE and browser with music running here in the background. I also can’t be bothered to figure out how to even disable all the CPU features that could lead to unstable benchmark results. And as we will see, the results can get very unstable indeed.

For benchmarking, I am using the divan crate here. While criterion is often said to provide more stable results, I do quite like the convenience of divan, especially being able to trivially run the code multi-threaded. To try to get some more stable results, I also experimented with iai-callgrind.

But lets take a look at the divan results first. Here is a benchmark run on Windows:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        1.012 µs      │ 1.524 µs      │ 1.037 µs      │ 1.048 µs
├─ _2_boxed_string      993.5 ns      │ 2.343 µs      │ 1.024 µs      │ 1.055 µs
├─ _3_boxed_boxed       1.499 µs      │ 5.112 µs      │ 1.574 µs      │ 1.637 µs
├─ _4_thread_local      1.099 µs      │ 1.862 µs      │ 1.124 µs      │ 1.144 µs
├─ _5_smallvec          1.099 µs      │ 1.637 µs      │ 1.131 µs      │ 1.137 µs
├─ _6_smolstr           781 ns        │ 1.412 µs      │ 793.5 ns      │ 803.2 ns
╰─ _7_smallvec_smolstr  756 ns        │ 999.7 ns      │ 762.2 ns      │ 772 ns

If we look at both the fastest and median time, we can see the following differences:

Going from Vec<String> to Box<[String]> is a wash.
The re-allocation going from String to Box<str> indeed hurts quite a bit.
Using either thread_local or stack-allocated smallvec improves upon that.
However, it seems to be slightly slower than the naive Vec<String> implementation.
Using smol_str is quite a win compared to the baseline.
Surprisingly, using both smallvec and smol_str in another tiny win.

To give an example of the very unstable results I seem to get, here are the results running the exact same code another time:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        1.018 µs      │ 2.581 µs      │ 2.106 µs      │ 2.048 µs
├─ _2_boxed_string      1.037 µs      │ 2.449 µs      │ 2.112 µs      │ 2.04 µs
├─ _3_boxed_boxed       1.537 µs      │ 6.412 µs      │ 2.537 µs      │ 2.562 µs
├─ _4_thread_local      1.118 µs      │ 2.481 µs      │ 2.143 µs      │ 2.094 µs
├─ _5_smallvec          1.106 µs      │ 17.63 µs      │ 2.134 µs      │ 2.8 µs
├─ _6_smolstr           899.7 ns      │ 10.99 µs      │ 2.699 µs      │ 2.955 µs
╰─ _7_smallvec_smolstr  762.2 ns      │ 4.043 µs      │ 2.743 µs      │ 2.717 µs

For some reason, the median time is around twice it was before across the board, with the median for the stack-allocated code appearing quite a bit slower now. The fastest times however are very much on par with the previous results however.

As I mentioned, one nice thing about divan is that you can run the benchmarks in multiple threads, simply by adding DIVAN_THREADS=16 to your environment.

Now we can see something like:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        1.699 µs      │ 204.8 µs      │ 4.199 µs      │ 5.569 µs
├─ _2_boxed_string      1.799 µs      │ 70.49 µs      │ 4.199 µs      │ 4.913 µs
├─ _3_boxed_boxed       2.299 µs      │ 89.49 µs      │ 5.099 µs      │ 5.796 µs
├─ _4_thread_local      2.499 µs      │ 110.7 µs      │ 5.899 µs      │ 7.752 µs
├─ _5_smallvec          1.599 µs      │ 60.59 µs      │ 3.199 µs      │ 3.638 µs
├─ _6_smolstr           1.399 µs      │ 82.89 µs      │ 3.749 µs      │ 4.255 µs
╰─ _7_smallvec_smolstr  1.099 µs      │ 20.97 µs      │ 1.949 µs      │ 2.043 µs

Here we can see:

smol_str combined with smallvec is the clear winner here.
Surprisingly, thread_local is quite a pessimization, even running things multi-threaded.

As mentioned previously, I am running 16 threads here on loaded Windows with a Ryzen 2700X CPU which has a 8C/16T configuration.

The thread_local results however were really surprising. I ran quite a bit of benchmarking and profiling on a macOS which showed clear wins. Possibly because the macOS system allocator is particularly slow?

I don’t really want to turn on the MacBook right now as I am on sabbatical right now. But I can try running the same benchmarks on Linux, or rather Windows Subsystem for Linux (WSL) to be exact.

Here are the results from Linux running in WSL on the exact Windows system I used to capture the above results:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        447.7 ns      │ 3.622 µs      │ 457.7 ns      │ 494.1 ns
├─ _2_boxed_string      452.5 ns      │ 488.5 ns      │ 455 ns        │ 455.3 ns
├─ _3_boxed_boxed       564.5 ns      │ 612.2 ns      │ 567 ns        │ 566.8 ns
├─ _4_thread_local      561.7 ns      │ 3.613 µs      │ 571.7 ns      │ 604.6 ns
├─ _5_smallvec          557.5 ns      │ 3.377 µs      │ 562.2 ns      │ 590.7 ns
├─ _6_smolstr           504.7 ns      │ 2.164 µs      │ 513.7 ns      │ 532.5 ns
╰─ _7_smallvec_smolstr  526.5 ns      │ 586.2 ns      │ 528.7 ns      │ 528.9 ns

There is quite some surprising results here:

Linux seems to be twice as fast as Windows, even running within WSL.
Again, the thread_local variant is slower than just directly allocating.
Most surprisingly, also using smol_str and smallvec is slower as well.

Interestingly, the results are quite more stable than on Windows, possibly because of a better CPU scheduler. I didn’t see wildly different results when I re-ran the same benchmark.

Running the benchmarks multi-threaded via DIVAN_THREADS=16 on Linux gives these results:

divan                   fastest       │ slowest       │ median        │ mean
├─ _1_vec_string        1.533 µs      │ 118.4 µs      │ 2.171 µs      │ 2.344 µs
├─ _2_boxed_string      1.69 µs       │ 37.37 µs      │ 2.217 µs      │ 2.311 µs
├─ _3_boxed_boxed       997.7 ns      │ 36.98 µs      │ 2.383 µs      │ 2.476 µs
├─ _4_thread_local      1.829 µs      │ 146.5 µs      │ 3.234 µs      │ 3.394 µs
├─ _5_smallvec          1.431 µs      │ 97.89 µs      │ 2.346 µs      │ 2.648 µs
├─ _6_smolstr           1.718 µs      │ 31.45 µs      │ 2.707 µs      │ 2.932 µs
╰─ _7_smallvec_smolstr  1.287 µs      │ 46.39 µs      │ 2.4 µs        │ 2.737 µs

These results are similar to the ones above:

The simplest implementation seems to be the fastest.
thread_local does not seem to be an improvement over just allocating directly.
smol_str and smallvec are not providing any gains here.

# Callgrind and Profiling

I mentioned that I was also experimenting with iai-callgrind, so lets look at those results.

callgrind::bench_group::_1_vec_string
  Instructions:                5272|5272            (No change)
  L1 Hits:                     7421|7421            (No change)
  L2 Hits:                        8|8               (No change)
  RAM Hits:                     122|122             (No change)
  Total read+write:            7551|7551            (No change)
  Estimated Cycles:           11731|11731           (No change)
callgrind::bench_group::_2_boxed_string
  Instructions:                5326|5326            (No change)
  L1 Hits:                     7492|7492            (No change)
  L2 Hits:                        8|8               (No change)
  RAM Hits:                     126|126             (No change)
  Total read+write:            7626|7626            (No change)
  Estimated Cycles:           11942|11942           (No change)
callgrind::bench_group::_3_boxed_boxed
  Instructions:                6380|6380            (No change)
  L1 Hits:                     8887|8887            (No change)
  L2 Hits:                       12|12              (No change)
  RAM Hits:                     131|131             (No change)
  Total read+write:            9030|9030            (No change)
  Estimated Cycles:           13532|13532           (No change)
callgrind::bench_group::_4_thread_local
  Instructions:                7065|7065            (No change)
  L1 Hits:                     9998|9998            (No change)
  L2 Hits:                       20|20              (No change)
  RAM Hits:                     139|139             (No change)
  Total read+write:           10157|10157           (No change)
  Estimated Cycles:           14963|14963           (No change)
callgrind::bench_group::_5_smallvec
  Instructions:                5851|5851            (No change)
  L1 Hits:                     8236|8236            (No change)
  L2 Hits:                       12|12              (No change)
  RAM Hits:                     133|133             (No change)
  Total read+write:            8381|8381            (No change)
  Estimated Cycles:           12951|12951           (No change)
callgrind::bench_group::_6_smolstr
  Instructions:                5035|5035            (No change)
  L1 Hits:                     7226|7226            (No change)
  L2 Hits:                       11|11              (No change)
  RAM Hits:                     138|138             (No change)
  Total read+write:            7375|7375            (No change)
  Estimated Cycles:           12111|12111           (No change)
callgrind::bench_group::_7_smallvec_smolstr
  Instructions:                5201|5201            (No change)
  L1 Hits:                     7373|7373            (No change)
  L2 Hits:                       12|12              (No change)
  RAM Hits:                     133|133             (No change)
  Total read+write:            7518|7518            (No change)
  Estimated Cycles:           12088|12088           (No change)

This is quite a lot to take in. The first thing you see in (No change). Indeed the results are stable. Re-running the benchmark yields the exact same numbers.

However, what are we even looking at? Wall time numbers are very obvious, and its also very obvious that lower is better. But which numbers are we trying to improve here exactly? Are we aiming for Instructions, or rather Estimated Cycles, which itself is derived from the L1, L2 and RAM numbers.

It is hard to say what exactly we are optimizing for. But we can see that the smol_str variant is winning when looking for the lowest Instructions. The simplest implementation also has the lowest Estimated Cycles, with the smol_str / smallvec solution being the runner-up.

Last but not least, lets do some profiling as well. I’m a huge fan of samply.

When doing profiling, you should keep in mind that you pulling a Heisenberg here, as you are changing the outcome by measuring it.

Although the overhead of profiling is negligibly small, and did not make any difference in my particular case here.

Here is a flamegraph for the simplest implementation using Vec<String>:

As expected, we see a lot of alloc nested within the fmt call. A bit surprising here is also the high cost of drop-ing all the allocated data.

Compare that with the flamegraph of the smol_str and smallvec solution:

In this flamegraph we can now clearly see the fmt related code, separate from the smol_str code, again separate from the allocation of the outer Box.

However, there is still some surprises hidden in this flamegraph:

SmolStr::new is surprisingly expensive, taking as much time as fmt-ing into the stack-allocated smallvec.
I see a big core::ptr::write there, which I assume is copying the SmolStr to the Vec.
The Iterator::collect abstraction leads to a very deep and complex call graph, though I assume all of that to be inlined.

# Conclusion

As I mentioned in the intro, I really enjoy obsessing over these tiny details. It was fun trying to micro-optimize this particular use-case, although one of the main outcomes of it might as well be:

There is no need to micro-optimize things this hard, as the simplest solution is already good enough in most cases. Some optimizations that I assumed would be beneficial even turned out not to be.

Another thing to note here though is that this is a micro benchmark. It was already hard to get halfway stable results. Not to mention that micro benchmarks can also be sensitive to very minute details like code alignment.

I definitely like macro benchmarks and profiling, aka looking at flamegraphs a lot more. It is much easier to see both opportunities as well as improvements that way.

Speaking of opportunities, SmolStr::new looks much slower than it needs to be, so I might try to optimize it further. The core::ptr::write I mentioned is also interesting, as I would have assumed that the compiler is smart enough to write the SmolStr directly into the allocation.

I believe the nightly-only new_uninit feature might help there. At the very least, that feature would allow returning a Arc<[T]> instead of a Box<[T]> without having to re-allocate. Converting from a Vec<T> to a Box<[T]> is pretty much free as long as the Vec is allocated with the final size known ahead of time. That is not possible for Arc, as the reference counts have to live somewhere as well. Arc::new_uninit_slice would solve that, as I believe its pretty much the same as Vec::with_capacity. Except it requires a bit of unsafe code to use.