[cfe-dev] [RFC] Compressing AST files by default?

Mon Oct 24 03:10:32 PDT 2016

On 21 October 2016 at 22:49, Richard Smith <richard at metafoo.co.uk> wrote:

> On Fri, Oct 21, 2016 at 12:48 PM, Gábor Horváth <xazax.hun at gmail.com>
> wrote:
>
>> On 21 October 2016 at 21:26, Richard Smith via cfe-dev <
>> cfe-dev at lists.llvm.org> wrote:
>>
>>> On Thu, Oct 20, 2016 at 2:23 AM, Ilya Palachev via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> It seems that compressing AST files with simple "gzip --fast" makes
>>>> them 30-40% smaller.
>>>> So the questions are:
>>>>  1. Is current AST serialization format really non-compressed (only
>>>> abbreviations in bit stream format)?
>>>>  2. Is it worthwhile to compress AST by default (with -emit-ast)?
>>>>  3. Will this break things like PCH?
>>>>  4. What's the current trade-off between PCH compile time and disk
>>>> usage? If AST compression makes compilation a bit slower, but reduces the
>>>> disk usage significantly, will this be appropriate for users or not?
>>>>
>>>> LLVM already has a support for compression (functions
>>>> compress/uncompress in include/llvm/Support/Compression.h).
>>>
>>>
>>> The current AST format is designed for lazy, partial loading from disk;
>>> we make heavy use of file offsets to pull in only the small portions of AST
>>> files that are actually used. In a compilation using hundreds or thousands
>>> of AST files, it's essential that we don't load any more than we need to
>>> (just the file headers) since we should need essentially nothing from
>>> almost all loaded files.
>>>
>>> Any approach that requires the entire file to be decompressed seems like
>>> a non-starter. I would expect you could get something like the 30-40%
>>> improvements you're seeing under gzip by making better use of abbreviations
>>> and using smarter representations generally. There is some easy low-hanging
>>> fruit here.
>>>
>>
>> I agree that I did see some low hanging fruits in the serialized AST
>> format. I did one measurement to see which parts of the ASTs are
>> contributing the most to the AST dumps' size. For the details see the json
>> attached to this mail: http://clang-developers.42468.
>> n3.nabble.com/Two-pass-analysis-framework-AST-merging-
>> approach-tp4051301p4052577.html
>>
>
> Cool! Is the tool you used to produce this available somewhere? (Are you
> combining the results of llvm-bcanalyzer or inspecting the bitcode files
> directly yourself?)
>

I modified the llvm-bcanalyzer to output the info in JSON format and used a
python script to summarize the projects. (And before that, I used -emit-ast
to all TU in the LLVM and Clang source tree).
I attached the python script I used to aggregate the output of the modified
bcanalyzer.

>
> Some obvious targets for adding abbrevs (>1GB and completely
> unabbreviated):
>
>                 "DECL_CXX_METHOD": {
>                     "count": 54583363,
>                     "bits": "4.49 GB",
>                     "abv": 0.0
>                 },
>
>                 "DECL_CXX_CONSTRUCTOR": {
>                     "count": 17594183,
>                     "bits": "1.47 GB",
>                     "abv": 0.0
>                 },
>
>                 "DECL_CXX_RECORD": {
>                     "count": 24180665,
>                     "bits": "1.1 GB",
>                     "abv": 0.0
>                 },
>
>                 "DECL_CLASS_TEMPLATE_SPECIALIZATION": {
>                     "count": 17971702,
>                     "bits": "1.77 GB",
>                     "abv": 0.0
>                 },
>
> A couple of other things I've been planning to improve AST file size (but
> not got around to yet):
>
> * We should allow each Decl kind to specify a list of abbreviations (from
> most specific to most general) and use the first one that fits the data. We
> should always use *some* abbreviation for every Decl, even if we only get
> to abbreviate the base Decl fields and use an array of VBR6 for the rest.
>
> * We store SourceLocations as absolute offsets right now, wasting a lot of
> bits on redundant information. Instead, we should store SourceLocations as
> a delta from the root location of the record we're reading (the location of
> the Decl or the start of the Stmt tree).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20161024/d4809d57/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: summarizeAsts.py
Type: text/x-python
Size: 2815 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20161024/d4809d57/attachment.py>