What makes a good test?

12 September 2024 — 11 min

I have had a couple discussions around the topic of testing philosophy and what a good test is recently, and I want to dive deeper into this.

# Background

Let me give a bit of background first.

I have experienced quite a bit of the software engineering landscape over the course of my career. I went through a bunch of languages through that time, and through frontend, backend, libraries and developer tools.

Through all of that, I have been kind of obsessed with clean code, quality and performance. Testing is a big part of that. But I also take a very pragmatic approach to all of that. Which in part also means not overdoing things, and using the right tools for the job.

Quite recently, I have also made the switch from Sentrys processing team, to codecov which was acquired by Sentry a while back. Folks who have been reading my blog for a while might know that I also have a passion for code coverage, which I also consider as a big part of code quality in general. I have done some work related to code coverage in the Rust ecosystem, and have introduced codecov to some of the Sentry Rust projects long before the acquisition.

Fun fact: you can also read this blog on cov.rs, as I also bought that domain a while ago, and it has been redirecting to this blog ever since, as I haven’t had enough motivation to use it for anything else yet.

And codecov does take code coverage very seriously and has very high code coverage for its own code. Though I have been criticizing its test suite as not being particularly good.

So this naturally brings up the question: What is a good test?

# It is okay not to test

Or rather: it is okay not to have an automated test suite.

Well obviously we want the software we write to do its intended job, and we verify that by testing. But it is perfectly okay to do that manually, just by running the software and verifying that it is doing its job.

Most of the time the reason for not doing that is that the effort is too high, and simply not worth it. Let me give you an example here.

I maintain the popular rust-cache GitHub Action. And it does not have any automated test suite. It does however have a bunch of example workflows that serve a dual purpose as tests. So what are the reasons for not having an automated test suite.

It depends way too heavily on an external service:

Being a GitHub Action, it relies way too heavily on the whole infrastructure around GHA and its supporting services. If I wanted to fully automatically test it, I would have to replicate GHA as faithfully as possible locally. To be quite honest, I wouldn’t even know where to start. The effort to do this is clearly very high, and not worth it for me.

It is just a thin wrapper around @actions/cache:

My action is just a specialized version of GitHubs own cache action, and shares the underlying code with it. I can just piggy-back on the assumption that it is somewhat well tested.

Although, I have done so in the past and I will repeat myself here: I think GHA, and in particular actions/toolkit are casually maintained abandonware, and I am doubtful of their quality in general.

The stakes are low:

By being a cache, it by definition does not provide any guarantees. A cache speeds up an expensive (idempotent) operation by reusing a previous result. If anything goes wrong with it, you just run the expensive operation and everything is fine.

On that note, I get way too many support requests for that action which I simply can’t answer, because I would either delegate them to the upstream code I use, to the way GitHub Actions fundamentally work, or remind folks that there is no such thing as a guaranteed cache hit.

Long story short, the point I’m trying to make is that testing, as everything else in software engineering is about tradeoffs. It is perfectly okay to resort to manual testing if the effort writing automated tests is too high.

# What is the purpose of a test?

But lets get back to the fundamentals and ask ourselves what the purpose of a test, or testing in general is.

A test should, roughly speaking, verify that a piece of software does what its supposed to do. In other words, it should be correct.

If we turn that around a bit, we can also say that a test failure should be indicative of a bug in the software. This also closes the loop to the topic of mutation testing, whereby you intentionally introduce bugs, and expect the testsuite to fail. If it does not, that means the test suite is incomplete. It can either mean that the testsuite does not cover all the edge cases, or that the piece of code containing the newly introduced bug is in fact dead and irrelevant.

If we again follow this train of thought, we can say that internal changes to the software which are not introducing bugs should not trigger testsuite failures. In other words, if I just refactor code, but do not change how it fundamentally works, I don’t expect the testsuite to fail.

Thus I postulate that a good test is one you write once and never change.

# What is a “unit”?

But hey, when you change your code, you also do need to change its tests, right? Right?

Well this naturally brings us to another question: what is the unit of code, the granularity or the boundary you want to test? Should it be a function, a class, maybe a module?

I would adapt the above statement a little bit and say that your unit of test should be something that never changes. This should ideally be the fundamental truths of your software.

Another important thing here is that it should be self contained as much as possible.

Good examples are data structures, some internals which have a clear boundary and you do not need to touch often, or the API surface of a library.

As a bad example, I would bring up my distaste for too narrow “unit” tests, or excessive mocking. Excessive mocking might be an indication that whatever you are testing is not self contained enough, or that you are asserting insignificant side effects.

I have seen mocks and assertions related to metrics or log messages being emitted. Those are truly insignificant side effects.

The question what a good test boundary is, is a lot harder to answer for a big cloud service monolith like codecov or Sentry.

Here I would design larger tests that fully represent a “user story”. In the case of codecov, this might be: I upload a coverage report, I then expect that report to show up in codecov. For Sentry, it would be: I upload debug files, then I upload a crash, and expect to see a fully symbolicated stack trace.

Yes, larger tests can be slower, and they might be harder to write, but I claim that they have a much higher signal to noise ratio than tests that have a too narrow scope.

# The place for snapshot tests

I have written about snapshot testing over two years ago already, where I advocated against them in general.

The problem with snapshot testing is that it has a notoriously low signal to noise ratio. Snapshots change all the time. And at one point both the developer and the reviewer just stop paying attention and just rubberstamp changes to them, at which point they completely lose their purpose.

There is a place for snapshot tests however, if you follow the above advice. If you encode the fundamental assumptions about your software that will never change, then snapshot tests are a great tool. I rely on those primarily and extensively in rollup-plugin-dts.

The only reason I ever have to touch those snapshots is when I update the underlying rollup dependency, which leads to changes in some output file names, or reordering of items. Otherwise those snapshots are stable and never change.

# UI testing

One thing I find notoriously hard to test is UI though, and so far I haven’t found a good solution for those either. Quite a while ago, I was a frontend engineer, or a developer tools engineer supporting frontend developers.

I have had experience with screenshot testing, as someone implementing the infrastructure for it at Eversports, and just as an observer at Sentry. My sad conclusion there is that screenshot testing just isn’t worth it.

On both occasions, they were just removed after a while, after they have caused significant pain for engineers.

I have also spent a ton of time working with cypress at Eversports, even figuring out a way how to collect code coverage for the frontend code, half a decade before joining codecov. The conclusion there was similar. Those types of tests are finicky, and are often flaky. Thus they are high effort, and low signal to noise ratio.

Even before that I was working at pagestrip, which had a high emphasis on touch gestures and animations. To this day I haven’t found a good solution to properly test those.

Usually you want to avoid animations (which I believe by now you can even signal as a preference via browser settings?) in tests, to speed things up and to reduce flakiness. But sometimes its all about the animations.

Similarly, I mentioned how bad screenshot tests are, but sometimes they are the only way to make sure that things look correct. Otherwise if you mess up some CSS, you content can be all over the place, your tooltips might be in the top left corner, and your messed up z-index means some element is obscuring parts of the page.

You kind of need screenshot tests for that. But screenshot tests are also kind of really really bad.

# What about Rust?

Even though I do waaay too much Python recently, I still mainly consider myself a Rust developer. So what does Rust do in this regard?

I haven’t mentioned this fact thus far, but Rust does check this checkmark:

A type checker solves half of your testing needs. That, combined with Rusts focus on “make invalid states unrepresentable”. This means you have less edge-cases, and less things to manually test for.

Rust can also guide you towards using better design patterns, implicitly by making bad design patterns cumbersome to deal with.

An example here is mocking. Specifically, in order to allow for mocks, you would have to introduce generics. And generics can be a pain to work with sometimes. They are infectious, so you would have to carry them around everywhere. Or you would have to resort to dynamic dispatch, which is bad for performance. Usage in async code is even more painful as you have to blindly add + Send + Sync + 'static, often without knowing why exactly.

All in all, I would say that Rust by itself discourages mocking, and rather encourages “plain old data”.

Apart from that, I believe Rust has a uniquely good position to testing in general compared to other language ecosystems. It has a builtin way of how to do testing, which a ton of languages do not. It also has doctests, which I love, and which will get a whole lot better in Rust 2024 by changing the way they are compiled and run.

Even though I do not really believe in strict nomenclature related to unit tests, I believe Rust does a good job allowing for different categories.

I would rather call them in-crate tests, which are being compiled along with the cfg(test) attribute, and as they are defined within the crate itself, they can have access to private internals. And then you have external tests, which are limited to only a crates public API. Plus the doctests I mentioned earlier.

Outside of the main language, you also have great utilities for snapshot testing, or testing CLI functionality with trycmd.

Again, as in a lot of other areas, Rust does a lot of things right. But it is certainly not the end of the story.

# Summary

To summarize, I believe these are some fundamental properties of good tests:

Tests should maximize the signal to noise ratio, meaning that a failing test should indicate a real bug in your code.

Tests should rarely change, so you train your reviewers to actually pay attention to changes, and not just rubberstamp them.

Tests should maximize the return on investment, meaning you chose the right boundaries for your tests so you can cover a lot of ground with reasonable effort.

Be pragmatic about it, which means that you strive for good high quality tests, but don’t overdo it. It is okay to not (automatically) test all the code.