[llvm-dev] RFC: Reducing Instr PGO size overhead

Tue Sep 8 11:40:14 PDT 2015

Xinliang David Li <davidxl at google.com> writes:
>>> The key type before the change is StringRef, while the after the key
>>> type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
>>> a string of 8 bytes or storing md5 has in text form which will double
>>> the size?
>>
>>
>> How much does this change the benefit? If most of the benefit is avoiding
>> extraordinarily long mangled names then it may be sufficient.
>
> I implemented the same functionality using the approach suggested by you.
>
> 1) There is no format change required for raw/indexed profile data
> 2) when a user option is specified, md5hash string is used as the
> function key instead of of the raw function name
>
> In addition to that,  the option specified is also passed to the
> runtime and emitted into profile data. This allows different variants
> of the profile data encoded in the same format  to be automatically
> recognized by the client (clang -fprofile-use, llvm-profdata etc)
> without the need to specify the option again.
>
> There are some size increase, but increase seems to be in the range
> that is acceptable. Using clang as the test case. The indexed profile
> size is now 33.6 M (compared with 29.5M which is the optimal case).
> The raw profile size is 54.1M (compared with 40.1M from previous
> patch).

Have any of these patches you're referencing been posted at all? I came
to the thread late and it's a bit hard to follow which is which.

In any case, the idea sounds fine to me. Names in the profile can be
optional assuming we can safely differentiate the functions without them
(aside: on-by-default seems more user friendly than off-by-default, but
I don't have a strong opinion. Sean?).

I have a couple of high level thoughts on the approach:

Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes:
> Looking at various sources of size increase caused by instrumentation,
> it is clear that the biggest contributor is the __llvm_prf_names
> section. The current PGO implementation uses function  assembly names
> as the lookup keys in the indexed format so it needs to be emitted in
> both rodata of the binary and in the raw/indexed profiles.

There's also the matter that we mangle the filename onto the symbol name
for non-ODR function names, which makes the names even larger than they
need to be. I imagine this isn't the biggest contributor (C++ mangled
names tend to be longer than filenames), but it is relatively easy to
fix independently of this change, by incorporating the filename into the
function's structural hash instead.

> On the other hand, LLVM's indexed format also supports 'key collision'
> -- it allows functions with the same key share the same entry in the
> hash table. Function's structural hash values will be used as a
> secondary key to find a real match.
>
> The above feature allows us to use a different key other than function
> assembly names and this can reduce binary/profile data size
> significantly.  Function name's MD5 hash is a good candidate, and I
> have a patch (3 parts: runtime, reader/writer, clang) that does just
> that. The results are very promising. With this change, the clang
> instrumented binary size is now 216M (down from 280M); the raw profile
> size is only 40.1M (a 2.85X size reduction); and the indexed profile
> size is only 29.5M (a 2.2X size reduction).

It should be fairly unlikely, but we should probably consider hash
collisions briefly. If we hit a collision in function names that also
collides in the structural hash, there is no way to recover in this
scheme. So:

1. The effect is that we get bogus data in each of the affected
   functions (the sum of the counters in each set, essentially).

2. To hit this, we need two functions with basically the same control
   flow, that have names that collide. Functions having the same control
   flow is obviously valid and likely, so really we only care about MD5
   collisions on function names. Are C++ mangled names long enough that
   collisions are likely?

As long as we're comfortable that (1) isn't so bad and (2) is
sufficiently rare this should be fine. Otherwise, we should consider a
better hash - though that would probably mean changing from 64 bits to
128 and reducing the size less.

Regardless, we can probably take this to -commits and start discussing
the patches themselves. I don't think anyone's opposed to the general
idea, as long as you don't break the use case that Sean brought up.