[cfe-dev] [RFC] Compressing AST files by default?

Fri Oct 21 13:49:37 PDT 2016

On Fri, Oct 21, 2016 at 12:48 PM, Gábor Horváth <xazax.hun at gmail.com> wrote:

> On 21 October 2016 at 21:26, Richard Smith via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> On Thu, Oct 20, 2016 at 2:23 AM, Ilya Palachev via cfe-dev <
>> cfe-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> It seems that compressing AST files with simple "gzip --fast" makes them
>>> 30-40% smaller.
>>> So the questions are:
>>>  1. Is current AST serialization format really non-compressed (only
>>> abbreviations in bit stream format)?
>>>  2. Is it worthwhile to compress AST by default (with -emit-ast)?
>>>  3. Will this break things like PCH?
>>>  4. What's the current trade-off between PCH compile time and disk
>>> usage? If AST compression makes compilation a bit slower, but reduces the
>>> disk usage significantly, will this be appropriate for users or not?
>>>
>>> LLVM already has a support for compression (functions
>>> compress/uncompress in include/llvm/Support/Compression.h).
>>
>>
>> The current AST format is designed for lazy, partial loading from disk;
>> we make heavy use of file offsets to pull in only the small portions of AST
>> files that are actually used. In a compilation using hundreds or thousands
>> of AST files, it's essential that we don't load any more than we need to
>> (just the file headers) since we should need essentially nothing from
>> almost all loaded files.
>>
>> Any approach that requires the entire file to be decompressed seems like
>> a non-starter. I would expect you could get something like the 30-40%
>> improvements you're seeing under gzip by making better use of abbreviations
>> and using smarter representations generally. There is some easy low-hanging
>> fruit here.
>>
>
> I agree that I did see some low hanging fruits in the serialized AST
> format. I did one measurement to see which parts of the ASTs are
> contributing the most to the AST dumps' size. For the details see the json
> attached to this mail: http://clang-developers.42468.
> n3.nabble.com/Two-pass-analysis-framework-AST-merging-approach-
> tp4051301p4052577.html
>

Cool! Is the tool you used to produce this available somewhere? (Are you
combining the results of llvm-bcanalyzer or inspecting the bitcode files
directly yourself?)

Some obvious targets for adding abbrevs (>1GB and completely unabbreviated):

                "DECL_CXX_METHOD": {
                    "count": 54583363,
                    "bits": "4.49 GB",
                    "abv": 0.0
                },

                "DECL_CXX_CONSTRUCTOR": {
                    "count": 17594183,
                    "bits": "1.47 GB",
                    "abv": 0.0
                },

                "DECL_CXX_RECORD": {
                    "count": 24180665,
                    "bits": "1.1 GB",
                    "abv": 0.0
                },

                "DECL_CLASS_TEMPLATE_SPECIALIZATION": {
                    "count": 17971702,
                    "bits": "1.77 GB",
                    "abv": 0.0
                },

A couple of other things I've been planning to improve AST file size (but
not got around to yet):

* We should allow each Decl kind to specify a list of abbreviations (from
most specific to most general) and use the first one that fits the data. We
should always use *some* abbreviation for every Decl, even if we only get
to abbreviate the base Decl fields and use an array of VBR6 for the rest.

* We store SourceLocations as absolute offsets right now, wasting a lot of
bits on redundant information. Instead, we should store SourceLocations as
a delta from the root location of the record we're reading (the location of
the Decl or the start of the Stmt tree).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20161021/a8624700/attachment.html>