[PATCH] D12715: Reduce PGO Instrumentation binary and profile data size (Patch-1)

Thu Sep 10 20:12:10 PDT 2015

>>
>> raw counter itself is not enough. All implementation needs a per-function
>> control data __llvm_profile_data_<func_name> that stores the key or
>> reference to key, structure/cfg hash, and number of counts in the counter
>> variable. It also has a pointer/reference to the counter variable itself. To
>> save 8 byte per function, we can choose to embed the counter variable into
>> llvm_profile_data_xx variable to make it variable length, but that is an
>> orthogonal change.
>
>
> I was talking from a purely theoretical standpoint. If all counters are in a
> single section then their order in the section implicitly identifies them,
> so there does not need to be any extra overhead in the program's memory
> image at runtime. (more generally, think like symbolicating a call stack
> without having debug info present in the program image). The association
> between the counter and all the auxiliary data can happen offline when
> llvm-profdata looks at the raw profile in conjunction with the binary.

True. This means offline data is needed for generating indexed format.
Enforcing the same order (and debugging the problems) can be tricky.

>
>>
>>
>> > 2. size overhead in the raw profile. Ideal case: just the raw counter
>> > values.
>> > 3. size overhead in the indexed profile. No clear ideal case.
>> > 4. raw binary size of the program's ELF file. No clear ideal case.
>> > (did I miss any?)
>>
>> I would split 4 into
>> 4.1 unstripped binary size
>> 4.2 stripped binary size
>>
>
> Good point.
>
>>
>>
>> Let's compare three cases
>>
>> A) The existing implementation
>> B) The implementation using 64bit md5 integer as func-key (my originally
>> proposed approach) -- which is the ideal solution I mentioned above
>> C) The implementation using 16 byte md5 hash str as func key -- which is
>> what this patch implements.
>>
>> The following summarizes the per-function overhead (assuming average func
>> name size is 82.
>>
>>
>>        (1)     (2)     (3)       (4.1)          (4.2)
>> ------------------------------------------------------------
>>  A      82      82      82     4*82+3*40+59        82
>>  B      0       0       0      3*82+3*40+59        0
>>  C      16      16      16     16+3*82+3*40+59     16
>>
>> A little explanation of the size overhead of 4.1. It includes symbol names
>> for __llvm_profile_data_xxx, __llvm_profile_counters_xxx, and
>> __llvm_profile_name_xxx. The value 59 comes from the name prefixes. It also
>> assumes Elf64 symtab entry overhead of 40 bytes.
>>
>> We can see B) is the clear winner in all categories.  There are other
>> techniques to reduce overhead in 4.1
>> a) more aggressively making those symbols with private linkage so that
>> linker can merge them. (with B, we do need to keep __llvm_profile_name_xxx
>> symbol non-local, but other symbols can be privatized)
>> b) embed llvm_profile_counter into llvm_profile_data so that we don't need
>> to create symbols for them. This also helps all other categories ( negative
>> 8 byte overhead per func)
>> c) make prefixes of those symbols shorter.
>>
>> Unstripped binary size is probably not too interesting to optimize for,
>> but it may cause linker to reach limit (especially when -g is also
>> specified).
>>
>> >
>> > For me, only 1. is of primary importance. 2. and 3. are not relevant for
>> > programs of the size I have (~ size of Clang) and my users' setups (an
>> > entire host machine is available to handle the data, and only a single
>> > profile is written at a given time). 4. is irrelevant since large
>> > binaries
>> > must already be handled anyway for debug info.
>> >
>> > For an embedded use case 1. and 2. are important but 3,4 are irrelevant
>> > since that happens on the host (assume host has much more resources than
>> > target).
>>
>> 4) can prevent the binary from being installable on the device due to the
>> size limit.
>
>
> 4.2, but not 4.1. In general, if we are to assume that a binary needs to be
> passed to llvm-profdata, we should require the unstripped binary, since that
> gives us the most information; we can later relax the requirement if
> necessary while remaining compatible.
>
>>
>>
>> >
>> > It sounds like for your use case, 2,3,4 are the primary concerns. Is
>> > this
>> > correct?
>>
>> All of them (we care about large servers and small devices).
>
>
> How small?

Without going too much details, lets say something like system
partition size is only a couple hundred MB etc ..

David

>
> -- Sean Silva
>
>>
>>
>> thanks,
>>
>> David
>>
>> >
>> > -- Sean Silva
>> >
>> >
>> >>
>> >>
>> >> >Do you know how much improvement there would be over the current patch
>> >> > if we could completely eliminate the names from the memory image and
>> >> > from
>> >> > the raw profile?
>> >>
>> >> There are 16 byte overhead per function for emitting the name. For
>> >> clang, the overhead of the additional rodata is about 6% of the total
>> >> text size.
>> >>
>> >> Note that if we don't use strings (name or md5string) but just 64bit
>> >> md5hash as the key (as done by my original patch), the perf-function
>> >> overhead for the keys  is zero -- because the Name reference field in
>> >> llvm_profile_data_* structure can be reused.
>> >>
>> >> > I.e., how large are the just the counters by themselves
>> >> > compared to the current raw profile data? Quickly looking at some
>> >> > source
>> >> > code (but not testing anything) suggests that there is one `struct
>> >> > __llvm_profile_data` per counter. Depending on the pointer size, this
>> >> > struct
>> >> > is 3-4x as large as the counter itself.
>> >>
>> >> _llvm_profile_data_*** is  one per function. The size of the
>> >> per-function data is 32 bytes (64bit arch), so __llvm_prf_data section
>> >> is larger than __llvm_prf_name.  So the largest is the counter section
>> >> when this patch is applied. Without this patch, the largest is
>> >> __llvm_prf_name (way larger).
>> >>
>> >>
>> >> >
>> >> > One other thing that bothers me is that (while I don't use llvm-cov)
>> >> > this
>> >> > size reduction in this patch is incompatible with llvm-cov.
>> >>
>> >> yes -- but they can be made compatible if we are ok with more drastic
>> >> change.
>> >>
>> >> >
>> >> > My thinking is that if we want to provide a feature for reducing the
>> >> > overhead in the runtime memory image or in the profraw, it should:
>> >> > - be fully general (e.g. cover the llvm-cov use case)
>> >> > - have nearly optimal size (3-4x overhead is not acceptable)
>> >> > - be completely transparent to the user, except for the fact that the
>> >> > binary
>> >> > must be passed to llvm-profdata (but not to clang; llvm-profdata
>> >> > should
>> >> > extract all necessary information from the binary)
>> >> >
>> >> > I think this is achievable and not hugely more complicated than the
>> >> > current
>> >> > patch.
>> >> >
>> >>
>> >> agreed (other than the 3-4x overhead statement above).
>> >>
>> >> > When it comes to adding a feature to clang that we must maintain
>> >> > forever, it
>> >> > makes more sense to have it be something like
>> >> > `-fprofile-instr-reduced-image-size` which can have a clear contract
>> >> > and
>> >> > the
>> >> > implementation of which we can continually improve (or make optimal).
>> >> > For
>> >> > this, the transparency is important (e.g. output of llvm-profdata
>> >> > should
>> >> > be
>> >> > the same). The current patch is highly tied to the precise
>> >> > implementation of
>> >> > what we are doing to reduce the size, and does not leave margin for
>> >> > improvement or for changing the underlying approach to cover new use
>> >> > cases
>> >> > (see below).
>> >> >
>> >> > What do you think?
>> >>
>> >> See above, we can even do without the option and make every use case
>> >> happy.
>> >>
>> >> >
>> >> > Out of curiosity, could you post the patch for your original approach
>> >> > that
>> >> > required the binary for llvm-profdata? Could it potentially work as
>> >> > an
>> >> > initial approach to "`-fprofile-instr-reduced-image-size`" that we
>> >> > could
>> >> > incrementally build on?
>> >> >
>> >>
>> >> I have not added the llvm-profdata debug dump support yet.  If we
>> >> reach an agreement that is a good thing to do (which I agree), I will
>> >> add that submit that patch instead.
>> >>
>> >> >
>> >> > The idea is that if a user says that there is too much overhead in
>> >> > the
>> >> > runtime memory image or in the profraw, then we can tell them to use
>> >> > a
>> >> > particular option to improve the situation, with the only caveat
>> >> > being
>> >> > that
>> >> > they must pass the binary to llvm-profdata;
>> >>
>> >> I don't have supporting data, but if majority of users do not need
>> >> debug symbol info for llvm-profdata (that often), would it be better
>> >> to require passing binary if per function dump is needed? Post
>> >> proposing script can also be easily developed for that. It also makes
>> >> the use model consistent with llvm-cov.
>> >>
>> >> >besides that, there are no
>> >> > strings attached, and no hidden issues for the user (e.g. in an
>> >> > embedded
>> >> > project, the code may be mostly C and the function names may
>> >> > typically
>> >> > be
>> >> > shorter than the hash, so this patch could make things worse). The
>> >> > current
>> >> > patch assumes that function name size is the primary overhead which
>> >> > makes
>> >> > PGO not usable; do we don't have a reason to believe that to be the
>> >> > case
>> >> > in
>> >> > general? I am interested to hear from the other commenter on the
>> >> > thread
>> >> > who
>> >> > mentioned issues with pgo size; still our sample set is very small.
>> >>
>> >> For C programs, using name hash string instead of name itself may
>> >> actually increase overhead -- it is probably a good reason that the
>> >> zero overhead approach is preferred.  Here are some data from SPEC
>> >> (average name size):
>> >>
>> >> GCC (C): 20
>> >> SJENG (C): 12
>> >> GZIP (C): 11
>> >> BZIP2 (C): 15
>> >> MCF (C) : 15
>> >> LIBQUANTUM (C): 17
>> >> HMMER (C) : 13
>> >> GOBMK (C): 20
>> >>
>> >> Xalan(C++) : 61
>> >> DealII (C++): 58
>> >> Clang (C++): 82
>> >>
>> >> Google benchmarks
>> >> >
>> >> > It seems premature to think that function name size is the greatest
>> >> > overhead
>> >> > in general since there are only O(#functions) of them, while there
>> >> > are
>> >> > O(#counters) unnecessary instances of `struct __llvm_profile_data`
>> >> > (and
>> >> > that
>> >> > struct is reasonably large compared to typical identifiers, even
>> >> > mangled)
>> >>
>> >> llvm_profile_data is also O(#functions).
>> >>
>> >> . A
>> >> > quick exploration in Mathematica suggests that for Clang, the bulk of
>> >> > the
>> >> > size overhead of function names is concentrated in the range 50-150
>> >> > characters (see http://i.imgur.com/en2mpBF.png; area under the curve
>> >> > of
>> >> > the
>> >> > first plot is total size of identifiers, with horizontal axis being
>> >> > bins
>> >> > by
>> >> > identifier size; second plot is left to right accumulated values of
>> >> > the
>> >> > first plot). Since `struct __llvm_profile_data` is ~25 bytes, this
>> >> > means
>> >> > that we need only assume that functions typically have 2-6 counters
>> >> > for
>> >> > the
>> >> > overhead of the structs to be comparable to the mangled names.
>> >>
>> >> llvm_prof_data is 32byte and it is per-function.
>> >>
>> >> ave negligible effect
>> >> > on the size as a whole.
>> >> >
>> >> >>
>> >> >>
>> >> >> >The approach of using name hashes is also not an incremental step
>> >> >> > towards
>> >> >> > the long-term solution. So thinking about it a bit more, I'm on
>> >> >> > the
>> >> >> > fence
>> >> >> > about whether a quick solution like this one even makes sense to
>> >> >> > do.
>> >> >>
>> >> >> I am on the fence too, but leaning towards having this one as the
>> >> >> longer term solution -- it is cleaner and less disruptive. It does
>> >> >> not
>> >> >> require bumping up the version numbers. (On the other hand, my
>> >> >> original patch does require version change)
>> >> >>
>> >> >> > I am very apprehensive of adding complexity to make this approach
>> >> >> > work,
>> >> >> > in particular threading information through many layers, as this
>> >> >> > is
>> >> >> > likely
>> >> >> > to get in the way of future work towards the long-term solution.
>> >> >>
>> >> >> I don't see the complexity actually. Other than some refactoring,
>> >> >> the
>> >> >> code here is very non-intrusive. I don't think it will get in the
>> >> >> way
>> >> >> of 'longer term' solution if it ever became a need to do.
>> >> >
>> >> >
>> >> > My complexity-detector must be tuned differently than yours.
>> >> >
>> >> > For example, -instr-prof-with-namehash introduces a weird coupling
>> >> > between
>> >> > the frontend and the middle-end. It seems to just be used for setting
>> >> > IncludeNamesInProfile, which seems to be completely orthogonal to
>> >> > hashing?
>> >>
>> >> This option is only useful for testing purpose (with opt).
>> >>
>> >> Anyway, I am fine with going forward with the most aggressive size
>> >> reduction approach that every one can benefit consistently.
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> >
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > My takeaways from the RFC thread (maybe I am misunderstanding) is
>> >> >> > that
>> >> >> > we are still using the same format just with one string vs.
>> >> >> > another.
>> >> >>
>> >> >> yes that is true -- that is why version number is not bumped up.
>> >> >>
>> >> >> > We shouldn't need to rev the version and there shouldn't need to
>> >> >> > be
>> >> >> > any
>> >> >> > changes in compiler-rt or llvm.
>> >> >>
>> >> >> Note that version is not changed. It is just the version field is
>> >> >> now
>> >> >> partitioned to allow more information to be passed around. This is a
>> >> >> very useful feature to have.
>> >> >>
>> >> >>
>> >> >> >Clang can easily diagnose (as an isolated piece of logic) if the
>> >> >> > user
>> >> >> > did
>> >> >> > the pgo training run with md5 mode enabled but the md5 mode flag
>> >> >> > was
>> >> >> > not
>> >> >> > passed or vice versa (it is easy to tell if input function names
>> >> >> > are
>> >> >> > md5
>> >> >> > sums or not as a diagnostic heuristic).
>> >> >>
>> >> >> If we want to enforce matching options in profile-gen and
>> >> >> profile-use,
>> >> >> the mismatched option and profile data should result in hard errors
>> >> >> --
>> >> >> current implementation allows it to be caught as a hard error (with
>> >> >> encoding in the profile data).
>> >> >
>> >> >
>> >> > I don't see in the patch where this error is issued (nor where that
>> >> > error is
>> >> > tested).
>> >> >
>> >> >>
>> >> >>
>> >> >> >It is also inconsistent for llvm-profdata to be able to be passed
>> >> >> > `-function=foo` for an md5 profile; the user should have to pass
>> >> >> > (truncated)
>> >> >> > md5 of the function name.
>> >> >>
>> >> >> This is usability and convenience -- not inconsistency.
>> >> >
>> >> >
>> >> > Fundamentally, trying to paper over the fact that what is in the file
>> >> > is
>> >> > md5
>> >> > sums is wrong. There are many ways that it sneaks through with this
>> >> > approach, and I don't see a fundamental reason to try to add a
>> >> > special
>> >> > case
>> >> > for this one option. If this is needed for convenience, a separate
>> >> > patch
>> >> > can
>> >> > add an option `-md5-function=` which automatically takes the md5 hash
>> >> > of
>> >> > the
>> >> > provided argument to save the trouble.
>> >> >
>> >> > -- Sean Silva
>> >> >
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >
>> >> >> > ================
>> >> >> > Comment at:
>> >> >> > test/Instrumentation/InstrProfiling/profiling_md5hash_key.ll:5-6
>> >> >> > @@ +4,4 @@
>> >> >> > +
>> >> >> > + at __llvm_profile_name_foo = hidden constant [16 x i8]
>> >> >> > c"5cf8c24cdb18bdac"
>> >> >> > +; CHECK: @__llvm_profile_name_foo = hidden constant [16 x i8]
>> >> >> > c"5cf8c24cdb18bdac", section "__DATA,__llvm_prf_names", align 1
>> >> >> > + at __llvm_profile_name_bar = hidden constant [17 x i8]
>> >> >> > c"e413754a191db537\00"
>> >> >> > ----------------
>> >> >> > This is backward from a standard md5 sum in hex. We should
>> >> >> > maintain
>> >> >> > the
>> >> >> > invariant that the resulting symbol name is a substring of a
>> >> >> > standard
>> >> >> > md5
>> >> >> > sum. Otherwise it is very difficult to explain how to get the hash
>> >> >> > for a
>> >> >> > function.
>> >> >> >
>> >> >> > E.g. we should be able to say "the symbol name is effectively
>> >> >> > replaced
>> >> >> > by the first half of its md5 sum in hex":
>> >> >>
>> >> >>
>> >> >> Good catch. That can be fixed.
>> >> >>
>> >> >> thanks,
>> >> >>
>> >> >> David
>> >> >>
>> >> >> >
>> >> >> > ```
>> >> >> > Sean:~ % echo -n 'foo' | md5
>> >> >> > acbd18db4cc2f85cedef654fccc4a4d8
>> >> >> > Sean:~ % echo -n 'foo' | md5 | head -c 16
>> >> >> > acbd18db4cc2f85c
>> >> >> > ```
>> >> >> >
>> >> >> > The symbol name string should be `acbd18db4cc2f85c`.
>> >> >> >
>> >> >> >
>> >> >> > http://reviews.llvm.org/D12715
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>>
>