[llvm-dev] RFC: Reducing Instr PGO size overhead

Fri Oct 9 16:06:46 PDT 2015

On Fri, Oct 9, 2015 at 3:58 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>
> On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com>
> wrote:
>>
>> There is no further response to this, so I will assume general
>> direction of solution-3 is acceptable ;)

> No response does not mean "LGTM".
>

What I meant is that the discussion can be moved on to the formal code
review. I have not yet submitted the final patch for review yet.
Before that is ready, continue using this thread to voice your
concerns.

David

> -- Sean Silva
>
>>
>>
>> Solution-3 can be further improved. Instead of using static symbol
>> table (with zero size __llvm_prf_nm symbols) to store function names
>> for profile display and coverage mapping, the function names can be
>> stored compressed in a non-allocatable section. The compression ratio
>> for function name strings can be very high (~5x).  The covmapping data
>> can also be made non-allocatable.
>>
>> thanks,
>>
>> David
>>
>> On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com>
>> wrote:
>> > Sorry for the late update. I finally found time to implement a solution
>> > (Solution-3) that has the best size savings (for both PGO and coverage
>> > testing) with symbolic information available.  Here is a re-cap of what
>> > we
>> > have discussed so far:
>> >
>> > Solution-1:
>> >
>> > This is the original solution proposed. In this solution, A function
>> > name's
>> > MD5 hash is used as the primary key (combined with IR hash) for function
>> > lookup. __llvm_prf_names section won't be emitted into the binary nor
>> > dumped
>> > into the profile data unless -fcoverage-mapping is also specified.
>> >
>> >     Pros:
>> >         1. maximal reduction of instrumentation binary process image
>> > size
>> >         2. maximal reduction of object and unstripped binary size
>> >         3. maximal reduction of raw profile data size
>> >         4. maximal reduction of indexed profile data size
>> >
>> >    Cons:
>> >         1. -fcoverage-mapping use case is not handled -- the size
>> > problem
>> > still exist
>> >         2. profile dump with llvm-profdata no longer have function names
>> > associated -- user needs to use postprocessing tool to get the
>> > functionality
>> >         3. function filtering with partial function name is not
>> > supported
>> >         4. Requires incompatible format change
>> >
>> >
>> > Solution-2: (http://reviews.llvm.org/D12715)
>> >
>> > In this solution, the MD5 hash string is used to replace the raw name
>> > string
>> >       Pros:
>> >           1. Very simple to implement
>> >           2. have good reduction of all sizes for typical large C++
>> > applications
>> >           3. No profile data format change is required.
>> >
>> >       Cons:
>> >           1. Still requires 16byte overhead per-function -- can actually
>> > hurt C programs
>> >           2. -fcoverage-mapping use case is still not handled
>> >           3. The problem with llvm-profdata still exists (no symbolic
>> > info,
>> > partial filtering support)
>> >
>> >
>> > Solution-3:
>> >
>> > This is the new solution I am proposing. It is basically an enhancement
>> > of
>> > Solution-1 with most of the weakness resolved.  The difference with
>> > Solution-1 is
>> >       1. Function name symbols are emitted into the symbol table as weak
>> > externs. They don't occupy any space at runtime and can be easily
>> > stripped.
>> >       2. -fcoverage-mapping does not need special handling -- it
>> > automatically benefit from the same size saving.
>> >       3. llvm-cov is changed to read symbol info from the symtab instead
>> > of
>> > reading them from the name section data
>> >       4. llvm-profdata is enhanced to take a binary as input and dump
>> > profile with names attached. Function filtering is fully supported
>> > (option
>> > can also be introduced to force dumping names into binary and profile
>> > data,
>> > so that llvm-profdata use case is not changed at all).
>> >
>> > Pros:
>> >        1. All the pros from Solution-1
>> >        2. Size savings for coverage-mapping case
>> > Cons:
>> >        Format change is required for profile data and coverage mapping.
>> >
>> > The initial patch is here: http://reviews.llvm.org/D13251
>> >
>> > With this patch, the size of a release clang binary with coverage
>> > mapping is
>> > reduced from 986M to 569M.
>> >
>> > If there are no major concerns,  I will carve out the patch into smaller
>> > ones for review.
>> >
>> > thanks,
>> >
>> > David
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com>
>> > wrote:
>> >>
>> >> >> >>
>> >> >> >> yes -- it is fixed length (8byte) blob which may include null
>> >> >> >> byte
>> >> >> >> in
>> >> >> >> the middle.
>> >> >> >
>> >> >> >
>> >> >> > For reference, MD5 sum is 16 bytes (128-bit):
>> >> >> > https://en.wikipedia.org/wiki/MD5
>> >> >>
>> >> >> yes, LLVM's MD5 hash only takes the lower 64bit.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> > Or to say it another way, suppose that Itanium mangling
>> >> >> >> > required
>> >> >> >> > as a
>> >> >> >> > final
>> >> >> >> > step to replace the string with its md5 sum in hex. Therefore
>> >> >> >> > all
>> >> >> >> > symbol
>> >> >> >> > names are "small". My understanding is that this is effectively
>> >> >> >> > all
>> >> >> >> > your
>> >> >> >> > patch is doing.
>> >> >> >>
>> >> >> >> The key type before the change is StringRef, while the after the
>> >> >> >> key
>> >> >> >> type is uint64_t. Are you suggesting treating uint64_t md5 sum
>> >> >> >> key
>> >> >> >> as
>> >> >> >> a string of 8 bytes or storing md5 has in text form which will
>> >> >> >> double
>> >> >> >> the size?
>> >> >> >
>> >> >> >
>> >> >> > How much does this change the benefit? If most of the benefit is
>> >> >> > avoiding
>> >> >> > extraordinarily long mangled names then it may be sufficient.
>> >> >> >
>> >> >> > With IR-level instrumentation like Rong is pursuing the size may
>> >> >> > be
>> >> >> > reduced
>> >> >> > sufficiently that we do not need the optimization proposed in this
>> >> >> > thread.
>> >> >> > For example, Rong found >2x size reduction on Google's C++
>> >> >> > benchmarks,
>> >> >> > which
>> >> >> > I assume are representative of the extremely large Google binaries
>> >> >> > that
>> >> >> > are
>> >> >> > causing the problems addressed by your proposal in this thread.
>> >> >> > The
>> >> >> > measurements you mention for Clang in this thread provide similar
>> >> >> > size
>> >> >> > reductions, so Rong's approach may be sufficient (especially
>> >> >> > because
>> >> >> > functions with extremely large mangled names tend to be small
>> >> >> > inline
>> >> >> > functions in header-only template libraries).
>> >> >>
>> >> >> Late instrumentation helps many cases. In some cases (as shown in
>> >> >> SPEC), the reduction in size is not as large. Reducing PGO overhead
>> >> >> will lower the bar for its adoption.
>> >> >>
>> >> >> >
>> >> >> > Of the points you mention in "Large size of overhead can limit the
>> >> >> > usability
>> >> >> > of PGO greatly", many of the issues are hard limits that prevent
>> >> >> > the
>> >> >> > use
>> >> >> > of
>> >> >> > PGO. Do you have a lower bound on how much the size of the PGO
>> >> >> > data
>> >> >> > must
>> >> >> > be
>> >> >> > reduced in order to overcome the hard limits?
>> >> >>
>> >> >> This is a static view:  Think about the situation where application
>> >> >> size is ever increasing; also think about situation where we want to
>> >> >>
>> >> >> collect more types of profile data. Think about situation where user
>> >> >> want to run pgo binaries on small devices with tiny memory/storage
>> >> >> ..
>> >> >
>> >> >
>> >> > If we want to reduce memory overhead at runtime and reduce the size
>> >> > of
>> >> > the
>> >> > raw profile data extracted from the target, there are clear
>> >> > solutions.
>> >> > Consider that debug info does not need to be loaded into the memory
>> >> > image of
>> >> > the target; why should information identifying each counter need to
>> >> > be?
>> >> > A file containing raw profile counters is a subset of a core dump; in
>> >> > most
>> >> > environments, a core dump does not need to have debug info or symbol
>> >> > names
>> >> > in it, but can be still be read in full detail in conjunction with
>> >> > the
>> >> > original binary.
>> >>
>> >> Yes -- there are many alternatives:
>> >> 1) emit the name key mapping as a side data at compile time, or
>> >> 2) emit them into nonloadable sections of the object file.
>> >>
>> >> Compared with the above, LLVM's existing design does have its own
>> >> advantage -- making it easier for tool to access 'debug' info for
>> >> counters.
>> >>
>> >> LLVM's coverage testing, on the other hand, take a hybrid approach: It
>> >> emits the coverage map as rodata, but does not pass it to the profile
>> >> dumper. I think it is better to emit covmap as a side data not
>> >> attached to target binary.
>> >>
>> >>
>> >> >
>> >> > Thus, as we require that the binary be passed to llvm-profdata, there
>> >> > is
>> >> > no
>> >> > fundamental reason that the memory image of the program, or the raw
>> >> > data
>> >> > extracted from the program, must have any size overhead besides the
>> >> > raw
>> >> > values of the counters themselves and any text size increase for
>> >> > incrementing them. If we are willing to impose this requirement on
>> >> > users,
>> >> > then as far as reducing memory overhead at runtime and reducing the
>> >> > size
>> >> > of
>> >> > the raw profile data extracted from the target, using hashed function
>> >> > names
>> >> > is clearly the wrong direction.
>> >> >
>> >> > *Without* imposing the requirement of passing the binary to
>> >> > llvm-profdata, I
>> >> > do like the ability to use hashed function names like you are
>> >> > proposing.
>> >> > It
>> >> > is a simple solution for reducing size overhead of function name
>> >> > strings
>> >> > with little complexity, as it is just swapping one string for
>> >> > another.
>> >>
>> >> Agree. The good news is that the overhead of hashed function names is
>> >> small enough that makes this approach attractive.
>> >>
>> >> thanks,
>> >>
>> >> David
>> >> >
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > Obviously LLVM must be able to support the extremely large
>> >> >> > binaries
>> >> >> > in
>> >> >> > your
>> >> >> > configuration (otherwise what use is LLVM as a compiler ;) My
>> >> >> > questions
>> >> >> > are
>> >> >> > primarily aimed at establishing which tradeoffs are acceptable for
>> >> >> > supporting this (both for LLVM and for you guys).
>> >> >>
>> >> >> As I said, with the modified proposal (after getting your feedback),
>> >> >> no PGO users in LLVM land is going to lose anything/functionality.
>> >> >> The
>> >> >> end result will be net win for general users of LLVM (even though
>> >> >> your
>> >> >> customers don't care about it), not just 'us' as you have mentioned
>> >> >> many times.
>> >> >>
>> >> >> >
>> >> >> > Btw, for us, the issue of PGO data size is not completely
>> >> >> > immaterial
>> >> >> > but
>> >> >> > is
>> >> >> > very different from your use case. For us, the primary issue is
>> >> >> > the
>> >> >> > additional memory use at run time, since PS4 games usually use
>> >> >> > "all"
>> >> >> > available memory. We had a problem with UBSan where the large
>> >> >> > amount
>> >> >> > of
>> >> >> > memory required for storing the UBSan diagnostic data at runtime
>> >> >> > required
>> >> >> > the game programmers to manually change their memory map to make
>> >> >> > room.
>> >> >> > +Filipe, do you remember how much memory UBSan was using that
>> >> >> > caused
>> >> >> > a
>> >> >> > problem?
>> >> >> >
>> >> >>
>> >> >> My proposal does help reducing rodata size significantly.
>> >> >
>> >> >
>> >> > Yes, that is why I think that this is a useful thing to do. I just
>> >> > want
>> >> > to
>> >> > be careful about existing use cases and the relevant workflow issues.
>> >> >
>> >> > -- Sean Silva
>> >> >
>> >> >>
>> >> >>
>> >> >> David
>> >> >>
>> >> >>
>> >> >> > -- Sean Silva
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> In the raw format, md5 sum key can be an embedded field in the
>> >> >> >> prf_data variable instead of as different var referenced by
>> >> >> >> prf_data.
>> >> >> >>
>> >> >> >> >
>> >> >> >> > If this is not the case, you should show your current patch so
>> >> >> >> > that
>> >> >> >> > we
>> >> >> >> > can
>> >> >> >> > discuss things concretely.
>> >> >> >>
>> >> >> >> It is not. See above about the difference.
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> It will be
>> >> >> >> >> >> very messy to support multiple formats in instr-codegen and
>> >> >> >> >> >> instr-runtime.  For compatibility concerns, the reader is
>> >> >> >> >> >> taught
>> >> >> >> >> >> to
>> >> >> >> >> >> support previous format, but the changes there are isolated
>> >> >> >> >> >> (also
>> >> >> >> >> >> expected to be removed in the future).
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> > My primary concern is that if the function name are not
>> >> >> >> >> >> > kept
>> >> >> >> >> >> > at
>> >> >> >> >> >> > all
>> >> >> >> >> >> > stages,
>> >> >> >> >> >> > then it becomes difficult to analyze the profile data in
>> >> >> >> >> >> > a
>> >> >> >> >> >> > standalone
>> >> >> >> >> >> > way.
>> >> >> >> >> >> > Many times, I have used `llvm-profdata show
>> >> >> >> >> >> > -all-functions
>> >> >> >> >> >> > foo.profdata`
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the resulting profile data and then imported that data
>> >> >> >> >> >> > into
>> >> >> >> >> >> > Mathematica
>> >> >> >> >> >> > for
>> >> >> >> >> >> > analysis.
>> >> >> >> >> >>
>> >> >> >> >> >> This is certainly a very valid use case.
>> >> >> >> >> >>
>> >> >> >> >> >> >My understanding of your proposal is that `llvm-profdata
>> >> >> >> >> >> > show
>> >> >> >> >> >> > -all-functions foo.profdata` will not show the actual
>> >> >> >> >> >> > function
>> >> >> >> >> >> > names
>> >> >> >> >> >> > but
>> >> >> >> >> >> > instead MD5 hashes,
>> >> >> >> >> >>
>> >> >> >> >> >> Yes.
>> >> >> >> >> >>
>> >> >> >> >> >> To support your use case, there are two solutions:
>> >> >> >> >> >>
>> >> >> >> >> >> 1) user can add -fcoverage-mapping option in the build
>> >> >> >> >> >> 2) introduce a new option -fprofile-instr-names that force
>> >> >> >> >> >> the
>> >> >> >> >> >> emission of the name sections in the .o file. This is
>> >> >> >> >> >> similar
>> >> >> >> >> >> to
>> >> >> >> >> >> 1),
>> >> >> >> >> >> but no covmap section is needed.
>> >> >> >> >> >>
>> >> >> >> >> >> llvm-profdata tool  will be taught to read the name section
>> >> >> >> >> >> and
>> >> >> >> >> >> attach
>> >> >> >> >> >> function names to the profile records.
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > Needing to pass the executable to llvm-profdata would cause
>> >> >> >> >> > deployment
>> >> >> >> >> > issues for my customers in practice.
>> >> >> >> >>
>> >> >> >> >> Why? The deployment needs to pass the profile data anyway
>> >> >> >> >> right?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Yes, but not the executable.
>> >> >> >> >
>> >> >> >> > The PGO training run is likely being run by a gameplay tester
>> >> >> >> > (non-programmer). In general the binary will not be lying
>> >> >> >> > around
>> >> >> >> > as a
>> >> >> >> > loose
>> >> >> >> > file anywhere, it will be part of a full package of the
>> >> >> >> > binary+assets
>> >> >> >> > (think
>> >> >> >> > like what will end up on a bluray disc). A game's binary
>> >> >> >> > *completely
>> >> >> >> > useless* without the assets, so except locally on a
>> >> >> >> > programmer's
>> >> >> >> > machine
>> >> >> >> > while they iterate/debug, there is no reason for a binary to
>> >> >> >> > ever
>> >> >> >> > exist
>> >> >> >> > as a
>> >> >> >> > standalone file.
>> >> >> >> >
>> >> >> >> > I'm not saying that needing the binary is insurmountable in any
>> >> >> >> > particular
>> >> >> >> > scenario. Just that it will cause a strict increase in the
>> >> >> >> > number
>> >> >> >> > of
>> >> >> >> > issues
>> >> >> >> > to deploying PGO.
>> >> >> >>
>> >> >> >>  Your concern is acknowledged.
>> >> >> >>
>> >> >> >> >
>> >> >> >> > These are much bigger "compatibility concerns" for me than for
>> >> >> >> > newer
>> >> >> >> > toolchains to accept the old format. For a change in format I
>> >> >> >> > can
>> >> >> >> > easily
>> >> >> >> > tell my users to replace an exe with a newer one and that is
>> >> >> >> > all
>> >> >> >> > they
>> >> >> >> > need
>> >> >> >> > to do and it takes 10 seconds, guaranteed. A workflow change is
>> >> >> >> > potentially
>> >> >> >> > a massive disruption and guaranteed to take more than 10
>> >> >> >> > seconds
>> >> >> >> > to
>> >> >> >> > fix
>> >> >> >> > (perhaps hours or days).
>> >> >> >>
>> >> >> >> ok.
>> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>   This
>> >> >> >> >> is no different from llvm-cov usage model.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > In practice, getting the performance of PGO is a higher
>> >> >> >> > priority
>> >> >> >> > for
>> >> >> >> > my
>> >> >> >> > users, so we should not assume that llvm-cov is being used.
>> >> >> >>
>> >> >> >> Glad to hear that :)
>> >> >> >>
>> >> >> >> thanks,
>> >> >> >>
>> >> >> >> David
>> >> >> >>
>> >> >> >> >
>> >> >> >> > -- Sean Silva
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> David
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Note that with 1) or 2), the user can still benefit from
>> >> >> >> >> >> the
>> >> >> >> >> >> reduced
>> >> >> >> >> >> profile size.
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > Let me reiterate that the size of the profile is not a
>> >> >> >> >> > problem
>> >> >> >> >> > I
>> >> >> >> >> > have
>> >> >> >> >> > observed in practice (nor have I heard of this being a
>> >> >> >> >> > problem
>> >> >> >> >> > in
>> >> >> >> >> > practice
>> >> >> >> >> > until this thread). Therefore I'm skeptical of any changes
>> >> >> >> >> > to
>> >> >> >> >> > our
>> >> >> >> >> > default
>> >> >> >> >> > behavior or any new requirements that are not opt-in.
>> >> >> >> >> >
>> >> >> >> >> > -- Sean Silva
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> thanks,
>> >> >> >> >> >>
>> >> >> >> >> >> David
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> >which will make it more difficult for me to do this kind
>> >> >> >> >> >> > of analysis (would require using nm on the original
>> >> >> >> >> >> > binary,
>> >> >> >> >> >> > hashing
>> >> >> >> >> >> > everything, etc.).
>> >> >> >> >> >> >
>> >> >> >> >> >> > btw, feel free to attach the patch even if it in a rough
>> >> >> >> >> >> > state.
>> >> >> >> >> >> > It
>> >> >> >> >> >> > can
>> >> >> >> >> >> > still
>> >> >> >> >> >> > help to clarify the proposal and be a good talking point.
>> >> >> >> >> >> > Fine-grained
>> >> >> >> >> >> > patch
>> >> >> >> >> >> > review for caring about the rough parts will happen on
>> >> >> >> >> >> > llvm-commits;
>> >> >> >> >> >> > the
>> >> >> >> >> >> > rough parts will not distract the discussion here on
>> >> >> >> >> >> > llvm-dev.
>> >> >> >> >> >> >
>> >> >> >> >> >> > -- Sean Silva
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> thanks,
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> David
>> >> >> >> >> >> >> _______________________________________________
>> >> >> >> >> >> >> LLVM Developers mailing list
>> >> >> >> >> >> >> llvm-dev at lists.llvm.org
>> >> >> >> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>