The size of Rust Futures
— 9 minI have recently discovered that Rust Futures, or rather, async fn
calls can
lead to surprising performance problems if they are nested too deeply.
I learned that the hard way by triggering a stack overflow in a
PR to symbolicator
.
I then tracked that down to unreasonably huge (as measured by mem::size_of_val
)
futures. The problem was also being exacerbated by the deeply nested async fn
calls in the moka
crate that I have reported here.
It is not my intention to bash that specific crate here, as I absolutely love its intuitive APIs, and would love to use it even more in the future. However the crate does make some wrong assumptions about how async Rust code works that I am sure not a lot of people are aware of, and which can cause problems.
Apart from highlighting the source of the problem in great depth, I also want to propose some workarounds for this specific issue.
The problem I will highlight is also only present in todays Rust (nightly 1.68). It is perfectly possible that the compiler will optimize these things in the future.
So, lets dive in, and consider this piece of code here:
pub async fn test() {
let _ = a(()).await;
}
pub async fn a<T>(t: T) -> T {
b(t).await
}
async fn b<T>(t: T) -> T {
c(t).await
}
async fn c<T>(t: T) -> T {
t
}
We have an outer future, which threads a generic argument through a nested call chain.
As the topic of this blog post is size, lets see how large this future is.
Instead of calling mem::size_of_val
at runtime, I will use the nightly-only
-Zprint-type-sizes
flag and show you an abbreviated output of that:
type: `[async fn test}`: 4 bytes
discriminant: 1 bytes
variant `Unresumed`: 0 bytes
variant `Suspend0`: 3 bytes
field `.__awaitee`: 3 bytes, offset: 0 bytes
type: `[async fn a]`: 3 bytes
discriminant: 1 bytes
variant `Unresumed`: 0 bytes
field `.t`: 0 bytes, offset: 0 bytes
variant `Suspend0`: 2 bytes
field `.t`: 0 bytes, offset: 0 bytes
field `.__awaitee`: 2 bytes
type: `[async fn b]`: 2 bytes
discriminant: 1 bytes
variant `Unresumed`: 0 bytes
field `.t`: 0 bytes, offset: 0 bytes
variant `Suspend0`: 1 bytes
field `.t`: 0 bytes, offset: 0 bytes
field `.__awaitee`: 1 bytes
type: `[async fn c]`: 1 bytes
discriminant: 1 bytes
variant `Unresumed`: 0 bytes
field `.t`: 0 bytes, offset: 0 bytes
In this first example, our T
is zero-sized. But still our outer future has a
4 bytes
size. The reason is that each future is internally a state machine
enum with a discriminant. That discriminant in our case is 1 bytes
each. Apart from
that, all our types have an alignment of 1 bytes
which is included in the normal
output of -Zprint-type-sizes
but I have removed that for brevity.
So far so good, what happens when we put a larger T
there? How about [0u8, 1024]
?
type: `[async fn test]`: 3076 bytes
variant `Suspend0`: 3075 bytes
field `.__awaitee`: 3075 bytes, offset: 0 bytes
type: `[async fn a]`: 3075 bytes
variant `Suspend0`: 3074 bytes
field `.t`: 1024 bytes, offset: 0 bytes
field `.__awaitee`: 2050 bytes
variant `Unresumed`: 1024 bytes
field `.t`: 1024 bytes, offset: 0 bytes
type: `[async fn b]`: 2050 bytes
variant `Suspend0`: 2049 bytes
field `.t`: 1024 bytes, offset: 0 bytes
field `.__awaitee`: 1025 bytes
variant `Unresumed`: 1024 bytes
field `.t`: 1024 bytes, offset: 0 bytes
type: `[async fn c]`: 1025 bytes
variant `Unresumed`: 1024 bytes
field `.t`: 1024 bytes, offset: 0 bytes
I have removed the discriminants in this output.
What we see however is that each of the nested futures has some space reserved
for its own copy of T
for the case when it is Unresumed
. In the Suspend0
case, it has to hold onto the nested future it is polling, but it also retained
a copy of its own T
. Why exactly is that?
I suspect it is because it has to move that T
into the new future somehow.
Remember that Rust futures are lazy by definition. Which mean calling an async fn
does nothing except create a new value type that you can move around, Box
or
Send
to other threads.
The data has to be copied around when calling the function a
, no?
Lets look at a bit of generated assembly. I will use my trusty
ready_or_diverge
function there to actually create and also poll the future.
I even had to annotate one of my futures with #[inline(never)]
, because yes,
the Rust compiler is very smart indeed and can even
make async futures disappear completely
in some cases.
You can look at the full example in the Compiler Explorer, but here is the relevant snippet:
example::a:
push rbx
mov rbx, rdi
mov edx, 1024
call qword ptr [rip + memcpy@GOTPCREL]
mov byte ptr [rbx + 3074], 0
mov rax, rbx
pop rbx
ret
example::poll_test:
push rbx
mov eax, 4128
call __rust_probestack
sub rsp, rax
We can see that our #[inline(never)]
fn a
has a call to memcpy
, so it is
copying our data around.
Though as I mentioned, the compiler is very good at optimizing things away.
Especially in my simplified testcase where I never actually return Pending
,
and the compiler is smart enough to figure that out and optimize everything
away.
However, we also see that my poll_test
function reserves 4128 bytes
on the
stack. That is indeed a local copy of my 1024 byte
buffer, the size of the
outer test
future, 3076 bytes
, and then some.
This is my stack overflow right there. One thing to note here is also that
__rust_probestack
checks for stack overflows explicitly and prints out a
helpful error message before aborting the program. Otherwise the stack overflow
would only manifest itself as a hard crash when trying to access out of bounds
memory. So thank you Rust for that :-)
If I add a Box::pin
around the test()
future I want to poll to completion,
the compiler still reserves 2080 bytes
of stack space, which is a little bit
more than twice my 1024 bytes
buffer. There is no call to __rust_probestack
in that case. Likely because there is some threshold at which the compiler
inserts that call (probably 4096 bytes
?).
That reservation goes away when I remove the #[inline(never)]
annotation.
But remember, this is a very simple playground example in which the compiler
can figure out that my future is polled to completion and is immediately Ready
.
I believe in real world examples, like the one I hit with my own real world code,
the compiler will have to fall back to a lot more memcpy
and stack allocations.
Aside: One side-effect of async fn
being actually two separate functions,
one creating the future, and one for its poll
implementation is that annotations
like #[inline]
currently apply to the outer creates the future function only.
There is an open issue about
that, and I believe it should be fairly straight forward to make that annotation
apply to both functions in this case.
So what have we learned so far?
Rust async fn
capture all of their arguments. And capture
in this case
means copying them into a new value type that represents the underlying future.
Depending on compiler optimizations, that involves a lot of memcpy
, and
possibly also huge stack allocations.
This especially hurts if you are passing huge arguments through multiple layers of function calls. The caller has one copy of the argument inside its own type, and it moves / copies that argument into the callee, effectively doubling the size of the argument type. With a deeply nested call chain, and large arguments, this can quickly cause problems.
The large argument in some cases is an impl Future
itself. So this can
easily become exponential.
And what can we do about this?
Unfortunately, I do not see a simple one-size-fits-all solution. There is tradeoffs everywhere. Either performance, or code-style.
One obvious solution would be to Box::pin
large futures. That is the solution
I went for to work around my own real world problem. That incurs a heap
allocation though which might have some runtime cost. This is a bit sad in my
case as I use it in combination with moka
, which as an in-memory cache should
optimize my fast-path. And now I have have a heap allocation in all the cases.
Another solution would be to group multiple arguments into a single reference:
pub async fn before_caller<A, B>(a: A, b: B) {
before_callee(a, b).await
}
async fn before_callee<A, B>(_a: A, _b: B) {}
pub async fn after_caller<A, B>(a: A, b: B) {
let ab = (a, b);
after_callee(&ab).await
}
async fn after_callee<A, B>(_ab: &(A, B)) {}
This definitely does not look nice, but it reduces the type that needs to be
copied into the future to 8 bytes
, the size of a normal references.
But sometimes we want to move and consume things, how to handle those cases?
Well, you can use a &mut Option
for that and just .take().unwrap()
. It is
ugly, but works. However it also has a cost. Doing an unwrap
has some runtime
cost, as well as generating a ton of panic messages.
And it leaves you open to developer error, as accidentally doing that twice
will panic. Using if let Some(x) = opt.take()
is a panic-free version with
a different tradeoff: It can lead to a latent bug that you will never know about.
So in this case doing a panic
is a good thing, as it reminds you of that bug.
If we are dealing with futures, we can also pass a Pin<&mut impl Future>
when
we pin the relevant future in the outermost callee. Again, not very nice, and
it also makes you vulnerable to polling that future again after it completed,
which will panic
.
Last but not least, we can combine all of these cases into a stateful wrapper
struct, and just use a &mut self
. Though this still leaves us with the
Option
/ poll
problem, so not a bulletproof solution either.
Well, here we are. Another PSA about some hidden pitfalls with Rust async fn
.
I hope by explaining all the details here, and even giving some suggestions
(though not perfect ones), you can avoid some of these problems in your own code.
Also, it is perfectly possible that the Rust compiler will get smarter over time
and could eventually completely remove this problem.
Using .await
already makes sure that you consume the future. If the compiler
can also prove that the future does not cross the async fn boundary, it should
be free to not only inline the runtime code, but also the data of a callee into
the caller. But not yet. Time will tell.
Update
After searching through the Rust issue tracker, there are multiple open issues
related to this.
A tracking issue about memory usage,
arguments being duplicated across yield points,
and inefficient codegen, mostly memcpy
related.
Following these issues through other interlinked issues and PRs indeed shows that it is a well known issue, and there have been multiple experiments and proposals so far for how to solve it piece by piece, but not all of those were successful.
All in all, this makes me quite confident that this problem can indeed be solved
over time. And maybe even my own work to remove GenFuture
, and the
identity_future
I had to leave in its place, can help with this effort as well.