Swatinem Blog Resume

Non-abbreviated Abbreviations

— 6 min

I have recently investigated a very interesting performance problem in Sentrys symbolication infrastructure.

We got reports of an increasing number of out-of-memory situations of our infrastructure. This started rather randomly, and was not correlated to any deploys. I had the hunch that it might be related to some new form of data that customers were throwing at us.

And indeed, after some time, we were able to track this down a customer project that was fetching gigantic debug files from a custom symbol source. These files were on the order of 5.5 G in size, which made them the largest valid debug files I have seen thus far.

The next step was reproducing the issue locally. Sure enough, processing this debug file took an unreasonably long amount of time, and a whooping 18 G in peak memory usage. Attaching a profiler to that long running process revealed that it was indeed spending a large portion of its running time in dropping gimli::Abbreviations.

Looking through the gimli code, it became clear that none of the involved types had custom Drop implementations, but some nested Vecs. This could potentially explain the large memory usage, and with that comes the runtime for allocations and so on.

But where do all the Abbreviations come from?

# What are Abbreviations

Put simply, abbreviations in DWARF describe the schema, or the blueprint of debug information entries (DIE). This schema describes the type of DIEs and its attributes and children. The DIE itself then just has a code referring to its abbreviation / schema, and then just the raw contents of its attributes and children.

These abbreviations are meant to be reused a lot. There can be more than one list of abbreviations. A compilation unit (CU) can refer to its abbreviations list via an offset, and the DIE code is just the index in this list.

Depending on the linker, you can end up with a single global abbreviations list, or with smaller lists, one for each CU. Unfortunately most linkers are quite dumb, they mostly just concatenate raw bytes and patch up some offset here and there. Optimizing and deduplicating DWARF data is complex and slow after all.

Turns out, the notoriously slow MacOS dsymutil actually does some DWARF optimization. Specially to our case here, it does merge and deduplicate all the abbreviations.

# The gimli Problem

Getting back to my investigation, I found out that the crate we use for DWARF handling, gimli, was rather optimized for the case where each CU has its own abbreviations. each CU would parse (and allocate) the abbreviations it was referring to. Abbreviations were not shared across CUs as is the case with smarter linkers. So far this was never a problem because we were dealing with relatively "small" files. But the file I was looking at was gigantic, and had a huge number of CUs. doing all this duplicate work and memory allocations for each CU got very expensive very quickly.

I reported a problem in detail and also suggested a PR to fix it. There is also another PR that offers the same benefits with a bit simpler API.

# Doing some more tests

Okay, lets look at some more examples and run some tests.

A well known and public project with very large debug files is Electron. You can download these debug files from https://github.com/electron/electron/releases/tag/v20.1.4 if you are interested to reproduce my experiments.

I have downloaded the MacOS x64 and Linux x64 debug files. The Linux debug file in the electron archive additionally has zlib compressed debug sections. We can unpack those ahead of time using llvm-objcopy --decompress-debug-sections electron.debug.

Using llvm-objdump --section-headers we can look at both the MacOS and Linux debug files.

The relevant line for the MacOS symbol is: __debug_abbrev 00004e61, and for Linux it is .debug_abbrev 02df0539. Or written as decimal, the MacOS abbreviations are feathery light with only about 20K, whereas the linux file has a whooping 50M of abbreviations.

We can also count these with a bit of shell magic: llvm-dwarfdump --debug-abbrev electron.dsym | grep DW_TAG | wc -l. There is 1_112 for MacOS, and 3_475_269 for Linux. That is a lot.

And how expensive is having redundant abbreviations in the raw data, vs redundant parsing, vs deduplicated parsing?

I wrote a very simple test that just iterates over all of the CUs in a file:

    let mut units = dwarf.units();
    while let Some(header) = units.next()? {
        let _unit = dwarf.unit(header)?;

Running this code on the published version of gimli, vs the PR linked above gives me the following change for the Linux file that has one abbreviation per CU:

Benchmark #1: ./gimli-0.26.2 ./tests/abbrevs/electron.debug
  Time (mean ± σ):     852.6 ms ±  23.9 ms    [User: 800.5 ms, System: 52.6 ms]
  Range (min … max):   820.0 ms … 879.8 ms    20 runs

Benchmark #2: ./gimli-patched-626 ./tests/abbrevs/electron.debug
  Time (mean ± σ):     860.2 ms ±  28.1 ms    [User: 801.0 ms, System: 52.8 ms]
  Range (min … max):   819.4 ms … 916.0 ms    20 runs

A tiny regression from more indirection. Lets try the same for the deduplicated MacOS DWARF:

Benchmark #3: ./gimli-0.26.2 ./tests/abbrevs/electron.dsym
  Time (mean ± σ):      4.780 s ±  0.052 s    [User: 4.705 s, System: 0.066 s]
  Range (min … max):    4.719 s …  4.874 s    20 runs

Benchmark #4: ./gimli-patched-626 ./tests/abbrevs/electron.dsym
  Time (mean ± σ):     225.1 ms ±   1.8 ms    [User: 185.5 ms, System: 36.3 ms]
  Range (min … max):   219.6 ms … 230.7 ms    20 runs

That is a huge difference right there.

And finally with the original customer debug file I investigated:

Benchmark #5: ./gimli-0.26.2 ./tests/abbrevs/giant
  Time (mean ± σ):     105.556 s ±  1.733 s    [User: 104.921 s, System: 0.514 s]
  Range (min … max):   102.769 s … 108.016 s    10 runs

Benchmark #6: ./gimli-patched-626 ./tests/abbrevs/giant
  Time (mean ± σ):     760.7 ms ±  28.7 ms    [User: 647.5 ms, System: 104.7 ms]
  Range (min … max):   725.7 ms … 814.6 ms    10 runs

That is indeed a night and day difference.

In both cases, this makes a two orders of magnitude difference. However the example only tests an extremely limited part of our DWARF processing.

But it does highlight how important it is to cache redundant computations, as well as to deduplicate the raw data in the first place.