[PATCH] D12715: Reduce PGO Instrumentation binary and profile data size (Patch-1)

Wed Sep 9 16:29:41 PDT 2015

On Wed, Sep 9, 2015 at 1:16 AM, Sean Silva <chisophugis at gmail.com> wrote:
>
>
> On Tue, Sep 8, 2015 at 10:34 PM, Xinliang David Li <davidxl at google.com>
> wrote:
>>
>> On Tue, Sep 8, 2015 at 10:00 PM, Sean Silva <chisophugis at gmail.com> wrote:
>> > silvas added a comment.
>> >
>> > Fundamentally, this approach is not a long-term solution to reducing PGO
>> > data size. The correct long-term solution for reducing PGO data size
>> > requires passing the binary to llvm-profdata.
>>
>> This basically goes back to my original proposed solution.  Whether
>> this solution is longer term or not depends on whether it is
>> sufficiently effective in reducing size -- the data we have now
>> confirm that it is indeed effective.
>
>
> If we have the binary, the size of the names should not matter because they
> should not be present in the runtime memory image or in the raw profile
> data.

If we can tolerate the inconvenience of passing binary to
llvm-profdata, I think there is a very simple and nice way of doing
this.

Currently LLVM emits symbols like __llvm_profile_name_foo with
contents 'foo', this is basically a waste -- the string of 'foo' is
actually a substring of the symbol name itself.

In other words, we can choose to continue emit __llvm_profile_name_xxx
symbols as before, but with zero size.  llvm-profdata/llvm-cov tool
can read the symtab and reconstruct the md5hash values for them.

With this change,

1) no new option is needed
2) binary size, raw/indexed profile size will be minimized
consistently (even when coverage mapping is on).

>Do you know how much improvement there would be over the current patch
> if we could completely eliminate the names from the memory image and from
> the raw profile?

There are 16 byte overhead per function for emitting the name. For
clang, the overhead of the additional rodata is about 6% of the total
text size.

Note that if we don't use strings (name or md5string) but just 64bit
md5hash as the key (as done by my original patch), the perf-function
overhead for the keys  is zero -- because the Name reference field in
llvm_profile_data_* structure can be reused.

> I.e., how large are the just the counters by themselves
> compared to the current raw profile data? Quickly looking at some source
> code (but not testing anything) suggests that there is one `struct
> __llvm_profile_data` per counter. Depending on the pointer size, this struct
> is 3-4x as large as the counter itself.

_llvm_profile_data_*** is  one per function. The size of the
per-function data is 32 bytes (64bit arch), so __llvm_prf_data section
is larger than __llvm_prf_name.  So the largest is the counter section
when this patch is applied. Without this patch, the largest is
__llvm_prf_name (way larger).

>
> One other thing that bothers me is that (while I don't use llvm-cov) this
> size reduction in this patch is incompatible with llvm-cov.

yes -- but they can be made compatible if we are ok with more drastic change.

>
> My thinking is that if we want to provide a feature for reducing the
> overhead in the runtime memory image or in the profraw, it should:
> - be fully general (e.g. cover the llvm-cov use case)
> - have nearly optimal size (3-4x overhead is not acceptable)
> - be completely transparent to the user, except for the fact that the binary
> must be passed to llvm-profdata (but not to clang; llvm-profdata should
> extract all necessary information from the binary)
>
> I think this is achievable and not hugely more complicated than the current
> patch.
>

agreed (other than the 3-4x overhead statement above).

> When it comes to adding a feature to clang that we must maintain forever, it
> makes more sense to have it be something like
> `-fprofile-instr-reduced-image-size` which can have a clear contract and the
> implementation of which we can continually improve (or make optimal). For
> this, the transparency is important (e.g. output of llvm-profdata should be
> the same). The current patch is highly tied to the precise implementation of
> what we are doing to reduce the size, and does not leave margin for
> improvement or for changing the underlying approach to cover new use cases
> (see below).
>
> What do you think?

See above, we can even do without the option and make every use case happy.

>
> Out of curiosity, could you post the patch for your original approach that
> required the binary for llvm-profdata? Could it potentially work as an
> initial approach to "`-fprofile-instr-reduced-image-size`" that we could
> incrementally build on?
>

I have not added the llvm-profdata debug dump support yet.  If we
reach an agreement that is a good thing to do (which I agree), I will
add that submit that patch instead.

>
> The idea is that if a user says that there is too much overhead in the
> runtime memory image or in the profraw, then we can tell them to use a
> particular option to improve the situation, with the only caveat being that
> they must pass the binary to llvm-profdata;

I don't have supporting data, but if majority of users do not need
debug symbol info for llvm-profdata (that often), would it be better
to require passing binary if per function dump is needed? Post
proposing script can also be easily developed for that. It also makes
the use model consistent with llvm-cov.

>besides that, there are no
> strings attached, and no hidden issues for the user (e.g. in an embedded
> project, the code may be mostly C and the function names may typically be
> shorter than the hash, so this patch could make things worse). The current
> patch assumes that function name size is the primary overhead which makes
> PGO not usable; do we don't have a reason to believe that to be the case in
> general? I am interested to hear from the other commenter on the thread who
> mentioned issues with pgo size; still our sample set is very small.

For C programs, using name hash string instead of name itself may
actually increase overhead -- it is probably a good reason that the
zero overhead approach is preferred.  Here are some data from SPEC
(average name size):

GCC (C): 20
SJENG (C): 12
GZIP (C): 11
BZIP2 (C): 15
MCF (C) : 15
LIBQUANTUM (C): 17
HMMER (C) : 13
GOBMK (C): 20

Xalan(C++) : 61
DealII (C++): 58
Clang (C++): 82

Google benchmarks
>
> It seems premature to think that function name size is the greatest overhead
> in general since there are only O(#functions) of them, while there are
> O(#counters) unnecessary instances of `struct __llvm_profile_data` (and that
> struct is reasonably large compared to typical identifiers, even mangled)

llvm_profile_data is also O(#functions).

. A
> quick exploration in Mathematica suggests that for Clang, the bulk of the
> size overhead of function names is concentrated in the range 50-150
> characters (see http://i.imgur.com/en2mpBF.png; area under the curve of the
> first plot is total size of identifiers, with horizontal axis being bins by
> identifier size; second plot is left to right accumulated values of the
> first plot). Since `struct __llvm_profile_data` is ~25 bytes, this means
> that we need only assume that functions typically have 2-6 counters for the
> overhead of the structs to be comparable to the mangled names.

llvm_prof_data is 32byte and it is per-function.

ave negligible effect
> on the size as a whole.
>
>>
>>
>> >The approach of using name hashes is also not an incremental step towards
>> > the long-term solution. So thinking about it a bit more, I'm on the fence
>> > about whether a quick solution like this one even makes sense to do.
>>
>> I am on the fence too, but leaning towards having this one as the
>> longer term solution -- it is cleaner and less disruptive. It does not
>> require bumping up the version numbers. (On the other hand, my
>> original patch does require version change)
>>
>> > I am very apprehensive of adding complexity to make this approach work,
>> > in particular threading information through many layers, as this is likely
>> > to get in the way of future work towards the long-term solution.
>>
>> I don't see the complexity actually. Other than some refactoring, the
>> code here is very non-intrusive. I don't think it will get in the way
>> of 'longer term' solution if it ever became a need to do.
>
>
> My complexity-detector must be tuned differently than yours.
>
> For example, -instr-prof-with-namehash introduces a weird coupling between
> the frontend and the middle-end. It seems to just be used for setting
> IncludeNamesInProfile, which seems to be completely orthogonal to hashing?

This option is only useful for testing purpose (with opt).

Anyway, I am fine with going forward with the most aggressive size
reduction approach that every one can benefit consistently.

David

>
>>
>>
>> >
>> > My takeaways from the RFC thread (maybe I am misunderstanding) is that
>> > we are still using the same format just with one string vs. another.
>>
>> yes that is true -- that is why version number is not bumped up.
>>
>> > We shouldn't need to rev the version and there shouldn't need to be any
>> > changes in compiler-rt or llvm.
>>
>> Note that version is not changed. It is just the version field is now
>> partitioned to allow more information to be passed around. This is a
>> very useful feature to have.
>>
>>
>> >Clang can easily diagnose (as an isolated piece of logic) if the user did
>> > the pgo training run with md5 mode enabled but the md5 mode flag was not
>> > passed or vice versa (it is easy to tell if input function names are md5
>> > sums or not as a diagnostic heuristic).
>>
>> If we want to enforce matching options in profile-gen and profile-use,
>> the mismatched option and profile data should result in hard errors --
>> current implementation allows it to be caught as a hard error (with
>> encoding in the profile data).
>
>
> I don't see in the patch where this error is issued (nor where that error is
> tested).
>
>>
>>
>> >It is also inconsistent for llvm-profdata to be able to be passed
>> > `-function=foo` for an md5 profile; the user should have to pass (truncated)
>> > md5 of the function name.
>>
>> This is usability and convenience -- not inconsistency.
>
>
> Fundamentally, trying to paper over the fact that what is in the file is md5
> sums is wrong. There are many ways that it sneaks through with this
> approach, and I don't see a fundamental reason to try to add a special case
> for this one option. If this is needed for convenience, a separate patch can
> add an option `-md5-function=` which automatically takes the md5 hash of the
> provided argument to save the trouble.
>
> -- Sean Silva
>
>>
>>
>> >
>> >
>> > ================
>> > Comment at:
>> > test/Instrumentation/InstrProfiling/profiling_md5hash_key.ll:5-6
>> > @@ +4,4 @@
>> > +
>> > + at __llvm_profile_name_foo = hidden constant [16 x i8]
>> > c"5cf8c24cdb18bdac"
>> > +; CHECK: @__llvm_profile_name_foo = hidden constant [16 x i8]
>> > c"5cf8c24cdb18bdac", section "__DATA,__llvm_prf_names", align 1
>> > + at __llvm_profile_name_bar = hidden constant [17 x i8]
>> > c"e413754a191db537\00"
>> > ----------------
>> > This is backward from a standard md5 sum in hex. We should maintain the
>> > invariant that the resulting symbol name is a substring of a standard md5
>> > sum. Otherwise it is very difficult to explain how to get the hash for a
>> > function.
>> >
>> > E.g. we should be able to say "the symbol name is effectively replaced
>> > by the first half of its md5 sum in hex":
>>
>>
>> Good catch. That can be fixed.
>>
>> thanks,
>>
>> David
>>
>> >
>> > ```
>> > Sean:~ % echo -n 'foo' | md5
>> > acbd18db4cc2f85cedef654fccc4a4d8
>> > Sean:~ % echo -n 'foo' | md5 | head -c 16
>> > acbd18db4cc2f85c
>> > ```
>> >
>> > The symbol name string should be `acbd18db4cc2f85c`.
>> >
>> >
>> > http://reviews.llvm.org/D12715
>> >
>> >
>> >
>
>