A deep dive into Portable PDB Sequence Points

2 September 2022 — 6 min

Following up my last post about SourceMaps, this one here is about Portable PDB Sequence Points.

Only took me about a month to procrastinate ;-)

# Sequence Points, abstractly

Similar to SourceMaps and other debug formats, the sequence points allow mapping from IL offsets to source information.

The Portable PDB Format is specified in a markdown document here and is complementary to the main ECMA-335 specification that is available in PDF format.

In particular, Portable PDB defines a new #Pdb stream, a bunch of new tables contained in the #~ stream, as well as new Blob formats that are within the #Blob heap.

Section II.23.2 of the main ECMA-335 spec describes a very specific way to save compressed integers that does not look very familiar, and comes with the tradeoff of only allowing at most 29 usable bits.

0b0xxx_xxxx: 7 usable bits encoded as 1 byte.
0b10xx_xxxx 0bxxxx_xxxx: 14 usable bits encoded as 2 bytes.
0b110x_xxxx 0bxxxx_xxxx 0bxxxx_xxxx 0bxxxx_xxxx: 29 usable bits encoded as 4 bytes.

The encoding is using big endian byte order, and the signed encoding is using rotation to move the sign bit into the last position.

One of the additional tables defined in the Portable PDB spec is the MethodDebugInformation which references a blob in #Blob heap containing sequence points. The MethodDebugInformation and the sequence points blob can also reference source files in the Document table.

These sequence points have the following information:

the start IL offset,
the document,
the start line / column,
the end line / column.

There is a bunch of things to note here:

Only the start IL offset is explicitly given, so similar to SourceMaps, each sequence point implicitly extends to the next one.
There are also "hidden" sequence points, probably to denote gaps in the mappings.
One specialty here is that the sequence points do not give a position in the source code, but rather a span.

# State Machine

Similar to the other mapping formats, the sequence points blob also acts as a state machine.

You have some mutable state, and have instructions and deltas that modify that state.

In this case, we start out with a document, and the blob can have an instruction that changes that document. The IL offset, line and column are also given as a delta to the previous record. And the source span is also delta-encoded.

The encoding is further complicated by the fact that either signed or unsigned encoding is used based on some condition. For example, the column delta is unsigned in case the source span does not span multiple lines. It is signed otherwise. This totally makes sense, as a source span should never go backwards. But it does add complexity to the decoder / encoder.

# Decoding a mapping

As an exercise, lets try to decode the following blob, and walk through the bytes one by one.

Our initial state machine starts out at all 0 values.

blob: 00 00 18 2e 09 06 00 12 04 08 06 00 01 02 79

0x00: add 0 to the IL offset
0x00: set source span line delta to 0
0x18: set source span column delta to 24
0x2e: add 46 to the start line, unsigned for the first entry
0x09: add 9 to the start column, unsigned for the first entry
- Sequence Point: { il_offset: 0, source_span: [46:9 - 46:33] }
0x06: add 6 to the IL offset
0x00: set source span line delta to 0
0x12: set source span column delta to 18
0x04: add 2 to the start line, signed
0x08: add 4 to the start column, signed
- Sequence Point: { il_offset: 6, source_span: [48:13 - 48:31] }
0x06: add 6 to the IL offset
0x00: set source span line delta to 0
0x01: set source span column delta to 1
0x02: add 1 to the start line, signed
0x79 (0b0111_1001, 0b1111_1100 rotated): subtract 4 from the start column
- Sequence Point: { il_offset: 12, source_span: [49:9 - 49:10] }

Mind you, this was a very simple (but real-life) example. We did not have any hidden sequence points, document changes or source spans that span multiple lines. But it did highlight how parsing the sequence points blob work, and also that we can get along with 5 bytes per sequence point for simple cases. Not bad.

# How to use these mappings

So how do we make use of these mappings?

Assuming we have a "normal" .NET runtime, we can get the IL offset trivially via the StackFrame.GetILOffset method. However, what might not be entirely obvious from our look at the format so far is that the IL offset is per method.

Getting the method index is not particularly obvious or well documented. Starting from the Method of a StackFrame, we can access the MetadataToken.

Section II.22 of the ECMA-335 spec says how to interpret this:

Uncoded metadata tokens are 4-byte unsigned integers, which contain the metadata table index in the most significant byte and a 1-based record index in the three least-significant bytes.

The table index for MethodDefs is 0x06 which we can assert, and the rest is the method index that also corresponds to the index inside our MethodDebugInformation table.

And there you have it. With these two pieces of information, we can resolve a StackFrame to its source location, or even source span.

# The elephant in the room

What is missing now is actually finding the Portable PDB file.

The PDB file has a self-describing UUID inside its #Pdb stream. And the corresponding executable file has a special CodeView record that is slightly different from normal CodeView records though. The difference is documented in this specification though. I have previously written about some pitfalls related to CodeView records btw.

Either way, getting the CodeView record and thus the UUID at runtime is not trivial. It requires reading that record directly from the PE file via the PEReader class. Creating a file stream to access that file from disk might not always be possible. Neither is getting a hold of the memory region where the PE file might already be mapped at.

This is still an unsolved problem right now unfortunately. Though even ahead-of-time compiled mobile apps ship the PE files in their app bundles. Most likely for the embedded runtime metadata. Which makes me hopeful that we can access those at runtime somehow and close the loop.

# Summary

We took a deep dive into the Portable PDB format, and we learned a bunch of things about it:

Portable PDBs extend the ECMA-335 format. Both are reasonably well documented.
The PDB has a list of Documents and MethodDebugInformation with sequence points.
The sequence points blob forms a state machine that yields sequence points.
These sequence points have an IL offset, a document and source span.
You can get the IL offset and the method index at runtime fairly easily.

The Portable PDB does not include the data needed to pretty print function signatures. That is embedded in the ECMA-335 metadata of the executable file.

Speaking of which, the executable also has a reference to the Portable PDB via its UUID. But that is not readily available at runtime.

Two down, one to go. I previously explained SourceMaps in detail, and Portable PDB Sequence Points today. I plan to take a look at more formats in future posts, so look out for:

DWARF line programs
PDB line programs