A locking war story— 6 min
An alternative clickbait title for this could be: “
Read + Seek considered dangerous”.
This is a very interesting story, and one of the nice side effects of working on open source software is that I can share all of the details of it publicly, along with a link to the PR that implemented the fix.
As the alternative clickbait title suggests, the core of the problem is that both
Seek, and the combination
of the two need a
&mut reference to the reader to do any operations. So read-only access still requires an exclusive
reference. (Aside: This might be a good example why people have advocated to call
&mut exclusive access.)
In my example, I was dealing with a
zip::ZipArchive, which wraps a
Read + Seek, and needs
&mut self access to read
a file from the archive. So sharing this archive across multiple tasks that want to read files from it leads to lock
contention as only a single task can read files from the archive at a time.
zip file that
we call a
SourceBundle. This archive also contains a manifest, which has a bit of metadata for each file. Things like
the reference to the corresponding SourceMap for files that do not have an embedded
sourceMappingURL reference. And
also most importantly, this metadata includes a
url for that file, because SourceMap processing sadly still relies on
very brittle URLs. I touched on those problems in my previous post around file identity,
so I won’t go into more details.
# Being too Smart for our own Good
The primary driver of moving more parts of the processing to Rust was to be able to better reuse repeated computations. Our SourceMap processing infers function / scope names by parsing the minified source, and it builds a fast lookup index that is meant to be reused. Although the python code never did that. The stateful Rust service however has a variety of in-memory and on-disk caches to avoid expensive computations for each event that needs to be processed.
One of the more expensive computations that I wanted to avoid was opening up the zip archive and parsing the manifest
contained within. We then ended up with a parsed manifest / index, and an open
zip::ZipArchive, more precisely a
zip::ZipArchive<std::io::Cursor<&'data [u8]>>. So we already have a memory-mapped
&[u8] that gives us trivial
random access. But we need to wrap it in a
Cursor to make it into a
Read + Seek. As the
access, we also had to wrap it in a
Mutex. And this
Mutex was exactly the thing that was contended in this case.
Trying to avoid repeatedly opening and parsing the manifest by keeping it in-memory and sharing it across computations
combined with that
Mutex meant that all the events that needed access to a specific zip file were all contending on
that mutex. Feeding more events to a single server even made things worse, and caused trouble for the whole pipeline.
The problem with
Read + Seek is that it indeed needs to maintain some internal mutable state, namely the
cursor position. If it were not synchronized using a
&mut and a
Mutex, it would mean that concurrent readers could
potentially read garbage, or worse. So thank you Rust for the strict guarantees that avoided that :-)
The solution in the end was to give each reader its own (still
Mutex-locked) copy of the
ZipArchive. According to
its docs, it is a cheap to clone if its generic reader is, which is the case for
Cursor. Rolling out this fix indeed
fixed the contention problem for us, and our production systems are now much happier. Although they are still doing way
too much unzipping, but later on that.
# Can we do better?
The mutable state fundamentally comes from usage of
Read which implicitly updates a cursor position, and
does so explicitly. And this is a reasonable choice for
ZipArchive, as I believe if is most frequently used in
combination with a
std::io::BufReader<std::fs::File>. However, I believe there are a few crates out there that
abstract over the reader as well. For example, both
with shared references, and require an explicit
offset for each of the read methods, instead of maintaining the offset
In our case we have a memory-mapped
&[u8], and reading from that is a trivial memory access. I cannot overstate how
much of a productivity and sanity boost
mmap is. Sure, one might argue that
Read gives more explicit control, and
it is very obvious and explicit when a syscall and context switch to the kernel happens, whereas with
mmap that is
done implicitly via page faults. Maybe in some very extreme situations, deep control over this might be beneficial, but
in the general case, again, I cannot overstate how awesome
# A rant on zip
While the lock contention issue, and the read-only, but not really nature of
ZipArchive was a pain, but one that
was easily fixable, there is another issue looming here. Why are we using zip archives in the first place? The fact that
lock contention became a problem highlights that we are using these archives a lot. And while we have various caches all
over the place, one thing that is not cached right now is access to the files within that zip archive.
So we are really using the same files from within the same archives all over again. And we are decompressing them over and over again. I haven’t measured this in production yet, but running this through a local stress test highlights the fact that our processing is now mainly dominated by decompression.
Zip archives are great and they serve a specific purpose, but their main purpose is long-term archival as the name suggests, not frequent random access. There might be a possibility to still use zip archives, but using a compression algorithm that is faster for decompression, but that is a story for a different time. Along with the discussion to maybe use something else entirely.
All in all, I am fairly happy with the fact that decompression seems to now dominate the performance, as it means that the rest of the architecture at least is doing a really great job at being high-performance :-)