[llvm-dev] [RFC] Optimization remarks: LLVM bitstream format and future plans

Tue Jun 18 13:43:02 PDT 2019

Hello,

We have been looking into making optimization remarks more scalable.

We looked into a few formats that satisfy the following requirements:
  * allows streaming to a file: we want to avoid keeping all the remarks in memory
  * allows string deduplication: most of the strings are repeated [1]
  * is fast to parse: building clang with remarks results in 24,205,892 remarks
  * is compact to save on disk: building clang with YAML remarks results in 17.6GB of remarks
  * supports some kind of key-value pairing: we need to have arbitrary remark “arguments” [2]

We took a look at a few formats:
  * YAML: 3. & 4. are very far from being reasonable using this format.
  * MessagePack [3]: having support for this in LLVM is an advantage for this format. It allowed us to make parsing 5.5x faster and remark files more than 2x smaller.
  * clangd’s RIFF-based format [4]. 1. & 5. are not satisfied here.
  * .dia: parsing this format (using libclang) is not fast enough for us.
  * custom format: we managed to make remarks 11x smaller, and parsing fast enough. The main concern with a custom format is the maintenance and versioning of the format.
  * LLVM bitstream:
    1. by emitting a block per remark, we can stream to a file
    2. by using a string table that is found in the metadata separately we can deduplicate strings
    3. llvm-bcanalyzer runs in 20s over all the remark files for clang
    4. total size of remarks for clang is 1.3GB -> 13.4x smaller
    5. we can have an arbitrary number of records and describe them using abbreviations to provide a key-value-like pairing

We decided to go ahead with LLVM bitstream since it satisfies all our requirements and it is well-known by the community.

The remark generation part of the format is available for review at: https://reviews.llvm.org/D63466.

Another goal is to make it easy to find remarks for a given object file or binary. The way we want to do this on Darwin is to follow the debug info model: add a section to the object file, make the linker ignore it, let dsymutil pack it up and put the final result in the .dSYM bundle.

For that, I plan on making a few more changes:
  * Emit the bitstream metadata in the __LLVM,__remarks/.remarks section
  * Add the parsing logic to lib/Remarks/RemarksParser and make it usable through the C API
  * Add a tool, llvm-remarkutil, to merge the remarks from the object files into a standalone remark file
  * Add support do dsymutil to merge and generate a standalone remark file in the .dSYM bundle
  * Add support to llvm-remarkutil to convert from YAML to bitstream, to extract metadata from sections, and other utilities

Please let me know what you think!

Thanks,

— 
Francis

[1] 2x size reduction with https://reviews.llvm.org/rG7fee2b89fd6e5101bc590e0741f4d7a82b7715e1
[2] Usually, remarks have arbitrary arguments, like the “Args” part of:
```
--- !Missed
Pass:     inline
Name:     NoDefinition
DebugLoc: { File: 'test-suite/SingleSource/UnitTests/2002-04-17-PrintfChar.c',
            Line: 7, Column: 3 }
Function: printArgsNoRet
Args:
  - Callee:   printf
  - String:   ' will not be inlined into '
  - Caller:   printArgsNoRet
    DebugLoc: { File: 'test-suite/SingleSource/UnitTests/2002-04-17-PrintfChar.c',
                Line: 6, Column: 0 }
  - String:   ' because its definition is unavailable'
...
```
[3] https://msgpack.org/index.html
[4] https://reviews.llvm.org/rG50f3631057f717448ba34b4175daaa81215fbd5e