Swatinem Blog Resume

Format Ossification

— 6 min

Before going into the details of my recent discovery, lets define the term Ossification, as likely a lot of people have never heard that word before.

Imagine you have an extensible format or protocol. As example, we can take a list of elements of different type. The list is extensible. It can have an arbitrary number of elements, and over time the different types can also be extended.

This is great. But what happens if you never use this extensibility? Lets say that, for years and years, your list has always had exactly one element, and that element has always been of a very specific type.

Well, users of that format or protocol will start relying on that very fact, and will assert this assumption in code, or worse, in hardware.

So your format is extensible in theory, but you can never extend it in practice because tools have come to rely on a very specific size and order.

That is called Ossification and is sadly a reality, especially in network protocols.

And as I found out recently, it is also a thing for the COFF/PE file format, the format of Windows .exe/.dll files.

My journey starts with a Sentry Customer Issue. We got a report about a processing error that complained about an invalid "image type", whatever that means. (Image here is a loaded library/executable)

The image in question indeed was missing its type field, but it did have other fields that are normal for images in the sentry protocol. The event also made it clear that it was coming from Windows.

With that information I was looking at the code in the sentry-native SDK that collected these images, and indeed found some early-returns that would leave an image entry without a type. I fixed the issue by reordering the code so we still get a type even though we can’t find a CodeView record for the image.

A while later while investigating how to link from a C# stack trace to the corresponding portable PDB, I stumbled across the PEReader.ReadDebugDirectory method.

This method returned an Array of DebugDirectoryEntry, whereas the code from sentry-native I was looking at just two weeks earlier was reading a single entry. Interesting.

Fast forward to today, where I am again investigating a customer issue related to a .dll that does not seem to have a valid debug_id (which comes from the CodeView record mentioned above).

It took some time until the things I have seen clicked in my brain. What if our tools make wrong assumptions about the shape of a PE file and its Debug Directory Entries? What if for years all the PE files always had a single Debug Directory Entry that happened to be the CodeView record? What if suddenly some new compiler version is generating PE files that have more than one Debug Directory Entry, and the CodeView record is not the first one anymore?

Well, classic case of Ossification. Things are extensible in theory, but since that extensibility was never practiced for years, all the tools developed around this format came to expect things that are not true anymore.

# How did this happen?

Well, the simple answer is that the available documentation around all this is quite lacking to put it mildly.

The main documentation for IMAGE_DATA_DIRECTORY mentions a Size that is described as:

The size of the table, in bytes.

Okay, yeah, great. There is no documentation or example of what to do with this. It is not at all obvious this is supposed to be the number of bytes of an array, and that the resulting array has total_size / sizeof(IMAGE_DEBUG_DIRECTORY) elements.

The documentation for IMAGE_DEBUG_DIRECTORY is also quite outdated. The docs online describe the Type field up to number 9. The winnt.h header has defines up to number 20, without any description either.

If you happen to stumble upon the specification of the .NET/C# extension to PE/COFF, that document does indeed say this is an array:

This directory consists of an array of debug directory entries whose location and size are indicated in the image optional header.

Hooray, big success! The doc also describes some of the Types missing from the winnt.h header and the other documentation.

It also has a description for the CodeView record itself, which is lacking from the other Windows docs and from the winnt.h header.

In particular, this RSDS (PDB 7.0) CodeView format is being read by a huge number of tools, but I can’t find any official documentation anywhere. This .NET extension linked above is the closest I could find.

The MINIDUMP_MODULE documentation also mentions a CodeView record, but it is also missing a description of how to interpret it.

So to summarize, the PE format has very incomplete or outright missing documentation. And the tools dealing with it are probably cargo-culting wrong assumptions from one implementation to the next.

# What now?

Well, we figured out that a PE file can have multiple Debug Directory entries, and either one of them can be the CodeView record we are looking for.

Time to see which tool got this right, and fix the ones that got it wrong.

Here are PRs for sentry-native, goblin and object.

To my surprise, crashpad actually got this right. To my surprise because I was also looking at a customer minidump created by crashpad that was missing CodeView records for some of the minidump modules. (Yes, the loaded executable code is called image in PE and Sentry terminology, whereas minidumps call them modules. Confused yet?)

Looking at the customer .dll again, it became clear that it did have a Debug Directory entry, but it wasn’t a CodeView one. Maybe if it had one, it would indeed be the first? Even if, the point here is to not make any assumptions around that.

So in the end I was chasing a ghost all along. But at least I learned a ton in the process, and de-ossified a bunch of tools along the way.

The specific customer issue boils down to "fix your build system", and that is the end of the story.