Please just `panic!`
— 6 minError handling in Rust is a complex topic. And it is not the good kind of inherent complexity, which requires a sophisticated but also elegant solution.
Rather, it is complex because there is no right solution. It is also annoying and ugly most of the time.
Developers familiar with Rust may know the 2-3 ways of error handling, depending on your definition.
There is the Result
-based handling, with detailed error enum
s.
Then there is still the Result
-based, but more opaque anyhow::Error
.
And then there is the rather crude panic!
, which I will argue might be the right thing to use in some cases.
You will always hear the recommendation to use thiserror
, aka detailed error enum
s for libraries,
and anyhow
, aka opaque catch-all errors for binaries.
I would go a bit further and say that you should only use detailed error enum
s for foundational building blocks.
In situations where you would actually realistically expect users to match on those enums.
And let me tell you, I can’t really remember the place where I last matched on any error except for io::ErrorKind::NotFound
.
Its also reasonable to argue whether NotFound
should actually be an error, or rather just an Option
.
I think this is the only legitimate error that is meaningful to end users. All other errors are the result of bugs, infrastructure problems, or malicious attacks. Not something that has an obvious solution, either on the developer side, or the end user side.
Having too detailed error types, and error handling code just adds incidental complexity, and sometimes also has real performance impacts.
I have seen clippy raise its large_enum_variant
lint because of Result<(), SomeHugeErrorType>
,
where the error type was multiple hundreds of bytes.
Please stop optimizing for the 0,01% case.
Which brings be back to the topic.
Rust for a long time has been advertised as a systems language, though the wording has changed to foundational software more recently.
Rust puts a huge emphasis on memory safety, and thus also memory management. But in a good way, making memory management easy, and removing footguns.
In this process, Rust has made a choice to optimize for the 99,999% case by making allocation infallible.
Or rather, recognizing that malloc
will most likely never fail. And if it really does in the most unlikely
of cases, there is probably nothing we can do about it, so just throw up our hands and panic!
.
This has removed so much complexity, noise and pain from the developers, and liberated us, so we can rather focus on the happy path.
I would argue that we should focus on the happy path more, and panic!
in a lot more cases.
The ?
operator makes error propagation trivial and is also less straining on the eyes than seeing
.uwrap()
or .except("...")
everywhere.
On the other hand, actually matching and properly handling errors is often very noisy, and distracts from the actual code.
Error propagation via the ?
operator is also usually very efficient. Except maybe for cases where
the error type is hundreds of bytes.
A panic!
is more heavy-weight. But you know what? This is actually a good thing, in the sense that
most of the overhead comes from generating and displaying a proper stack trace.
Such a stack trace makes it possible to diagnose and thus fix an error in the first place.
anyhow
strikes a really good balance here. It can be propagated via ?
, it has a lightweight 8 bytes,
it can carry a stack trace, and chained error context.
Though anyhow
still has the problem that it attaches the stack trace where an error is converted to anyhow
,
and not where it actually originates. This can make fixing a bug a bit more difficult.
To highlight the importance of a stack trace, let me give you a real life example that I was stumbling across recently.
I was playing around with openraft, which is generic around a log/storage implementation.
Fortunately, openraft
comes with a conformance testsuite for storage implementations.
Here is a small excerpt from the openraft::testing::Suite::test_store
method:
run_fut(run_test(builder, Self::get_membership_from_log_gt_sm_last_applied_2))?;
run_fut(run_test(builder, Self::get_initial_state_without_init))?;
// ...
run_fut(run_test(builder, Self::save_vote))?;
run_fut(run_test(builder, Self::get_log_entries))?;
run_fut(run_test(builder, Self::limited_get_log_entries))?;
Now, because this is using normal lightweight errors, and ?
for propagation,
there is no stack trace attached to errors that are happening.
So when I run this testsuite and it fails, I get an error which doesn’t really mean anything to me.
I also don’t see anymore which specific testcase caused the failure, because the ?
propagation
is not preserving the call stack information.
So I had to resort to an interactive debugger, and stepping forward, and over the individual test cases until finding the one that was returning early from this method.
Then I had to restart the debugging session after setting a breakpoint before that test. Then stepping through some async boilerplate to some point later finally finding the code that was actually causing the error.
An interactive debugger has the option to break directly at a panic!
, which would
instantly point you to the failing code.
But okay, maybe the problem is just me, and that I haven’t yet learned how to use rr
,
or even any kind of interactive debugger effectively.
Though the point still stands that a panic!
would have helped me fix my problem faster.
This example is a bit special, as this is part of a test suite, and not production code.
So should you just panic!
in production code then? Well maybe not quite.
Though if you have some isolation between the thread that can panic!
, and it
won’t bring down your whole service, maybe its actually a reasonable thing to do.
For example, the tokio
runtime is catching task panic!
s, and most web frameworks
that spawn a task per request are handling those just fine.
And you probably won’t be able to show a better error message with detailed error enums
anyway. Or you just shouldn’t even try, as it would just expose internal details
that you do not want to leak.
Well that’s it for today. Try to keep it simple, and don’t over-optimize for the 0,01% case, please.