<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 21 October 2016 at 21:26, Richard Smith via cfe-dev <span dir="ltr"><<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-">On Thu, Oct 20, 2016 at 2:23 AM, Ilya Palachev via cfe-dev <span dir="ltr"><<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

It seems that compressing AST files with simple "gzip --fast" makes them 30-40% smaller.<br>

So the questions are:<br>

 1. Is current AST serialization format really non-compressed (only abbreviations in bit stream format)?<br>

 2. Is it worthwhile to compress AST by default (with -emit-ast)?<br>

 3. Will this break things like PCH?<br>

 4. What's the current trade-off between PCH compile time and disk usage? If AST compression makes compilation a bit slower, but reduces the disk usage significantly, will this be appropriate for users or not?<br>

<br>

LLVM already has a support for compression (functions compress/uncompress in include/llvm/Support/Compressi<wbr>on.h).</blockquote><div><br></div></span><div>The current AST format is designed for lazy, partial loading from disk; we make heavy use of file offsets to pull in only the small portions of AST files that are actually used. In a compilation using hundreds or thousands of AST files, it's essential that we don't load any more than we need to (just the file headers) since we should need essentially nothing from almost all loaded files.</div><div><br></div><div>Any approach that requires the entire file to be decompressed seems like a non-starter. I would expect you could get something like the 30-40% improvements you're seeing under gzip by making better use of abbreviations and using smarter representations generally. There is some easy low-hanging fruit here.</div></div></div></div></blockquote><div><br></div><div>I agree that I did see some low hanging fruits in the serialized AST format. I did one measurement to see which parts of the ASTs are contributing the most to the AST dumps' size. For the details see the json attached to this mail: <a href="http://clang-developers.42468.n3.nabble.com/Two-pass-analysis-framework-AST-merging-approach-tp4051301p4052577.html">http://clang-developers.42468.n3.nabble.com/Two-pass-analysis-framework-AST-merging-approach-tp4051301p4052577.html</a><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>______________________________<wbr>_________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a><br>

<br></blockquote></div><br></div></div>