Swatinem Blog Resume

The size of Rust Futures

— 10 min

I have recently discovered that Rust Futures, or rather, async fn calls can lead to surprising performance problems if they are nested too deeply.

I learned that the hard way by triggering a stack overflow in a PR to symbolicator. I then tracked that down to unreasonably huge (as measured by mem::size_of_val) futures. The problem was also being exacerbated by the deeply nested async fn calls in the moka crate that I have reported here.

It is not my intention to bash that specific crate here, as I absolutely love its intuitive APIs, and would love to use it even more in the future. However the crate does make some wrong assumptions about how async Rust code works that I am sure not a lot of people are aware of, and which can cause problems.

Apart from highlighting the source of the problem in great depth, I also want to propose some workarounds for this specific issue.

The problem I will highlight is also only present in todays Rust (nightly 1.68). It is perfectly possible that the compiler will optimize these things in the future.


So, lets dive in, and consider this piece of code here:

pub async fn test() {
    let _ = a(()).await;
}

pub async fn a<T>(t: T) -> T {
    b(t).await
}
async fn b<T>(t: T) -> T {
    c(t).await
}
async fn c<T>(t: T) -> T {
    t
}

We have an outer future, which threads a generic argument through a nested call chain.

As the topic of this blog post is size, lets see how large this future is. Instead of calling mem::size_of_val at runtime, I will use the nightly-only -Zprint-type-sizes flag and show you an abbreviated output of that:

type: `[async fn test}`: 4 bytes
    discriminant: 1 bytes
    variant `Unresumed`: 0 bytes
    variant `Suspend0`: 3 bytes
        field `.__awaitee`: 3 bytes, offset: 0 bytes
type: `[async fn a]`: 3 bytes
    discriminant: 1 bytes
    variant `Unresumed`: 0 bytes
        field `.t`: 0 bytes, offset: 0 bytes
    variant `Suspend0`: 2 bytes
        field `.t`: 0 bytes, offset: 0 bytes
        field `.__awaitee`: 2 bytes
type: `[async fn b]`: 2 bytes
    discriminant: 1 bytes
    variant `Unresumed`: 0 bytes
        field `.t`: 0 bytes, offset: 0 bytes
    variant `Suspend0`: 1 bytes
        field `.t`: 0 bytes, offset: 0 bytes
        field `.__awaitee`: 1 bytes
type: `[async fn c]`: 1 bytes
    discriminant: 1 bytes
    variant `Unresumed`: 0 bytes
        field `.t`: 0 bytes, offset: 0 bytes

In this first example, our T is zero-sized. But still our outer future has a 4 bytes size. The reason is that each future is internally a state machine enum with a discriminant. That discriminant in our case is 1 bytes each. Apart from that, all our types have an alignment of 1 bytes which is included in the normal output of -Zprint-type-sizes but I have removed that for brevity.

So far so good, what happens when we put a larger T there? How about [0u8, 1024]?

type: `[async fn test]`: 3076 bytes
    variant `Suspend0`: 3075 bytes
        field `.__awaitee`: 3075 bytes, offset: 0 bytes
type: `[async fn a]`: 3075 bytes
    variant `Suspend0`: 3074 bytes
        field `.t`: 1024 bytes, offset: 0 bytes
        field `.__awaitee`: 2050 bytes
    variant `Unresumed`: 1024 bytes
        field `.t`: 1024 bytes, offset: 0 bytes
type: `[async fn b]`: 2050 bytes
    variant `Suspend0`: 2049 bytes
        field `.t`: 1024 bytes, offset: 0 bytes
        field `.__awaitee`: 1025 bytes
    variant `Unresumed`: 1024 bytes
        field `.t`: 1024 bytes, offset: 0 bytes
type: `[async fn c]`: 1025 bytes
    variant `Unresumed`: 1024 bytes
        field `.t`: 1024 bytes, offset: 0 bytes

I have removed the discriminants in this output.

What we see however is that each of the nested futures has some space reserved for its own copy of T for the case when it is Unresumed. In the Suspend0 case, it has to hold onto the nested future it is polling, but it also retained a copy of its own T. Why exactly is that?

I suspect it is because it has to move that T into the new future somehow.

Remember that Rust futures are lazy by definition. Which mean calling an async fn does nothing except create a new value type that you can move around, Box or Send to other threads.

The data has to be copied around when calling the function a, no?

Lets look at a bit of generated assembly. I will use my trusty ready_or_diverge function there to actually create and also poll the future. I even had to annotate one of my futures with #[inline(never)], because yes, the Rust compiler is very smart indeed and can even make async futures disappear completely in some cases.

You can look at the full example in the Compiler Explorer, but here is the relevant snippet:

example::a:
        push    rbx
        mov     rbx, rdi
        mov     edx, 1024
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     byte ptr [rbx + 3074], 0
        mov     rax, rbx
        pop     rbx
        ret

example::poll_test:
        push    rbx
        mov     eax, 4128
        call    __rust_probestack
        sub     rsp, rax

We can see that our #[inline(never)] fn a has a call to memcpy, so it is copying our data around. Though as I mentioned, the compiler is very good at optimizing things away. Especially in my simplified testcase where I never actually return Pending, and the compiler is smart enough to figure that out and optimize everything away.

However, we also see that my poll_test function reserves 4128 bytes on the stack. That is indeed a local copy of my 1024 byte buffer, the size of the outer test future, 3076 bytes, and then some.

This is my stack overflow right there. One thing to note here is also that __rust_probestack checks for stack overflows explicitly and prints out a helpful error message before aborting the program. Otherwise the stack overflow would only manifest itself as a hard crash when trying to access out of bounds memory. So thank you Rust for that :-)

If I add a Box::pin around the test() future I want to poll to completion, the compiler still reserves 2080 bytes of stack space, which is a little bit more than twice my 1024 bytes buffer. There is no call to __rust_probestack in that case. Likely because there is some threshold at which the compiler inserts that call (probably 4096 bytes?).

That reservation goes away when I remove the #[inline(never)] annotation. But remember, this is a very simple playground example in which the compiler can figure out that my future is polled to completion and is immediately Ready. I believe in real world examples, like the one I hit with my own real world code, the compiler will have to fall back to a lot more memcpy and stack allocations.

Aside: One side-effect of async fn being actually two separate functions, one creating the future, and one for its poll implementation is that annotations like #[inline] currently apply to the outer creates the future function only. There is an open issue about that, and I believe it should be fairly straight forward to make that annotation apply to both functions in this case.


So what have we learned so far?

Rust async fn capture all of their arguments. And capture in this case means copying them into a new value type that represents the underlying future. Depending on compiler optimizations, that involves a lot of memcpy, and possibly also huge stack allocations.

This especially hurts if you are passing huge arguments through multiple layers of function calls. The caller has one copy of the argument inside its own type, and it moves / copies that argument into the callee, effectively doubling the size of the argument type. With a deeply nested call chain, and large arguments, this can quickly cause problems.

The large argument in some cases is an impl Future itself. So this can easily become exponential.


And what can we do about this?

Unfortunately, I do not see a simple one-size-fits-all solution. There is tradeoffs everywhere. Either performance, or code-style.

One obvious solution would be to Box::pin large futures. That is the solution I went for to work around my own real world problem. That incurs a heap allocation though which might have some runtime cost. This is a bit sad in my case as I use it in combination with moka, which as an in-memory cache should optimize my fast-path. And now I have have a heap allocation in all the cases.

Another solution would be to group multiple arguments into a single reference:

pub async fn before_caller<A, B>(a: A, b: B) {
    before_callee(a, b).await
}
async fn before_callee<A, B>(_a: A, _b: B) {}

pub async fn after_caller<A, B>(a: A, b: B) {
    let ab = (a, b);
    after_callee(&ab).await
}
async fn after_callee<A, B>(_ab: &(A, B)) {}

This definitely does not look nice, but it reduces the type that needs to be copied into the future to 8 bytes, the size of a normal references.

But sometimes we want to move and consume things, how to handle those cases? Well, you can use a &mut Option for that and just .take().unwrap(). It is ugly, but works. However it also has a cost. Doing an unwrap has some runtime cost, as well as generating a ton of panic messages.

And it leaves you open to developer error, as accidentally doing that twice will panic. Using if let Some(x) = opt.take() is a panic-free version with a different tradeoff: It can lead to a latent bug that you will never know about. So in this case doing a panic is a good thing, as it reminds you of that bug.

If we are dealing with futures, we can also pass a Pin<&mut impl Future> when we pin the relevant future in the outermost callee. Again, not very nice, and it also makes you vulnerable to polling that future again after it completed, which will panic.

Last but not least, we can combine all of these cases into a stateful wrapper struct, and just use a &mut self. Though this still leaves us with the Option / poll problem, so not a bulletproof solution either.


Well, here we are. Another PSA about some hidden pitfalls with Rust async fn. I hope by explaining all the details here, and even giving some suggestions (though not perfect ones), you can avoid some of these problems in your own code.

Also, it is perfectly possible that the Rust compiler will get smarter over time and could eventually completely remove this problem. Using .await already makes sure that you consume the future. If the compiler can also prove that the future does not cross the async fn boundary, it should be free to not only inline the runtime code, but also the data of a callee into the caller. But not yet. Time will tell.


Update

After searching through the Rust issue tracker, there are multiple open issues related to this. A tracking issue about memory usage, arguments being duplicated across yield points, and inefficient codegen, mostly memcpy related.

Following these issues through other interlinked issues and PRs indeed shows that it is a well known issue, and there have been multiple experiments and proposals so far for how to solve it piece by piece, but not all of those were successful.

All in all, this makes me quite confident that this problem can indeed be solved over time. And maybe even my own work to remove GenFuture, and the identity_future I had to leave in its place, can help with this effort as well.