[llvm-dev] RFC: Reducing Instr PGO size overhead

Wed Oct 14 00:30:19 PDT 2015

I plan to divide the patch into series of small patches. The
preparation patches will mostly be refactoring changes with NFC. This
will minimize the size of final patch(es) with functional changes to
help discussions.

thanks,

David

On Fri, Oct 9, 2015 at 4:06 PM, Xinliang David Li <davidxl at google.com> wrote:
> On Fri, Oct 9, 2015 at 3:58 PM, Sean Silva <chisophugis at gmail.com> wrote:
>>
>>
>> On Wed, Oct 7, 2015 at 11:12 PM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>>
>>> There is no further response to this, so I will assume general
>>> direction of solution-3 is acceptable ;)
>
>> No response does not mean "LGTM".
>>
>
> What I meant is that the discussion can be moved on to the formal code
> review. I have not yet submitted the final patch for review yet.
> Before that is ready, continue using this thread to voice your
> concerns.
>
> David
>
>
>
>
>> -- Sean Silva
>>
>>>
>>>
>>> Solution-3 can be further improved. Instead of using static symbol
>>> table (with zero size __llvm_prf_nm symbols) to store function names
>>> for profile display and coverage mapping, the function names can be
>>> stored compressed in a non-allocatable section. The compression ratio
>>> for function name strings can be very high (~5x).  The covmapping data
>>> can also be made non-allocatable.
>>>
>>> thanks,
>>>
>>> David
>>>
>>> On Tue, Sep 29, 2015 at 9:52 AM, Xinliang David Li <davidxl at google.com>
>>> wrote:
>>> > Sorry for the late update. I finally found time to implement a solution
>>> > (Solution-3) that has the best size savings (for both PGO and coverage
>>> > testing) with symbolic information available.  Here is a re-cap of what
>>> > we
>>> > have discussed so far:
>>> >
>>> > Solution-1:
>>> >
>>> > This is the original solution proposed. In this solution, A function
>>> > name's
>>> > MD5 hash is used as the primary key (combined with IR hash) for function
>>> > lookup. __llvm_prf_names section won't be emitted into the binary nor
>>> > dumped
>>> > into the profile data unless -fcoverage-mapping is also specified.
>>> >
>>> >     Pros:
>>> >         1. maximal reduction of instrumentation binary process image
>>> > size
>>> >         2. maximal reduction of object and unstripped binary size
>>> >         3. maximal reduction of raw profile data size
>>> >         4. maximal reduction of indexed profile data size
>>> >
>>> >    Cons:
>>> >         1. -fcoverage-mapping use case is not handled -- the size
>>> > problem
>>> > still exist
>>> >         2. profile dump with llvm-profdata no longer have function names
>>> > associated -- user needs to use postprocessing tool to get the
>>> > functionality
>>> >         3. function filtering with partial function name is not
>>> > supported
>>> >         4. Requires incompatible format change
>>> >
>>> >
>>> > Solution-2: (http://reviews.llvm.org/D12715)
>>> >
>>> > In this solution, the MD5 hash string is used to replace the raw name
>>> > string
>>> >       Pros:
>>> >           1. Very simple to implement
>>> >           2. have good reduction of all sizes for typical large C++
>>> > applications
>>> >           3. No profile data format change is required.
>>> >
>>> >       Cons:
>>> >           1. Still requires 16byte overhead per-function -- can actually
>>> > hurt C programs
>>> >           2. -fcoverage-mapping use case is still not handled
>>> >           3. The problem with llvm-profdata still exists (no symbolic
>>> > info,
>>> > partial filtering support)
>>> >
>>> >
>>> > Solution-3:
>>> >
>>> > This is the new solution I am proposing. It is basically an enhancement
>>> > of
>>> > Solution-1 with most of the weakness resolved.  The difference with
>>> > Solution-1 is
>>> >       1. Function name symbols are emitted into the symbol table as weak
>>> > externs. They don't occupy any space at runtime and can be easily
>>> > stripped.
>>> >       2. -fcoverage-mapping does not need special handling -- it
>>> > automatically benefit from the same size saving.
>>> >       3. llvm-cov is changed to read symbol info from the symtab instead
>>> > of
>>> > reading them from the name section data
>>> >       4. llvm-profdata is enhanced to take a binary as input and dump
>>> > profile with names attached. Function filtering is fully supported
>>> > (option
>>> > can also be introduced to force dumping names into binary and profile
>>> > data,
>>> > so that llvm-profdata use case is not changed at all).
>>> >
>>> > Pros:
>>> >        1. All the pros from Solution-1
>>> >        2. Size savings for coverage-mapping case
>>> > Cons:
>>> >        Format change is required for profile data and coverage mapping.
>>> >
>>> > The initial patch is here: http://reviews.llvm.org/D13251
>>> >
>>> > With this patch, the size of a release clang binary with coverage
>>> > mapping is
>>> > reduced from 986M to 569M.
>>> >
>>> > If there are no major concerns,  I will carve out the patch into smaller
>>> > ones for review.
>>> >
>>> > thanks,
>>> >
>>> > David
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Sep 8, 2015 at 3:47 PM, Xinliang David Li <davidxl at google.com>
>>> > wrote:
>>> >>
>>> >> >> >>
>>> >> >> >> yes -- it is fixed length (8byte) blob which may include null
>>> >> >> >> byte
>>> >> >> >> in
>>> >> >> >> the middle.
>>> >> >> >
>>> >> >> >
>>> >> >> > For reference, MD5 sum is 16 bytes (128-bit):
>>> >> >> > https://en.wikipedia.org/wiki/MD5
>>> >> >>
>>> >> >> yes, LLVM's MD5 hash only takes the lower 64bit.
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> > Or to say it another way, suppose that Itanium mangling
>>> >> >> >> > required
>>> >> >> >> > as a
>>> >> >> >> > final
>>> >> >> >> > step to replace the string with its md5 sum in hex. Therefore
>>> >> >> >> > all
>>> >> >> >> > symbol
>>> >> >> >> > names are "small". My understanding is that this is effectively
>>> >> >> >> > all
>>> >> >> >> > your
>>> >> >> >> > patch is doing.
>>> >> >> >>
>>> >> >> >> The key type before the change is StringRef, while the after the
>>> >> >> >> key
>>> >> >> >> type is uint64_t. Are you suggesting treating uint64_t md5 sum
>>> >> >> >> key
>>> >> >> >> as
>>> >> >> >> a string of 8 bytes or storing md5 has in text form which will
>>> >> >> >> double
>>> >> >> >> the size?
>>> >> >> >
>>> >> >> >
>>> >> >> > How much does this change the benefit? If most of the benefit is
>>> >> >> > avoiding
>>> >> >> > extraordinarily long mangled names then it may be sufficient.
>>> >> >> >
>>> >> >> > With IR-level instrumentation like Rong is pursuing the size may
>>> >> >> > be
>>> >> >> > reduced
>>> >> >> > sufficiently that we do not need the optimization proposed in this
>>> >> >> > thread.
>>> >> >> > For example, Rong found >2x size reduction on Google's C++
>>> >> >> > benchmarks,
>>> >> >> > which
>>> >> >> > I assume are representative of the extremely large Google binaries
>>> >> >> > that
>>> >> >> > are
>>> >> >> > causing the problems addressed by your proposal in this thread.
>>> >> >> > The
>>> >> >> > measurements you mention for Clang in this thread provide similar
>>> >> >> > size
>>> >> >> > reductions, so Rong's approach may be sufficient (especially
>>> >> >> > because
>>> >> >> > functions with extremely large mangled names tend to be small
>>> >> >> > inline
>>> >> >> > functions in header-only template libraries).
>>> >> >>
>>> >> >> Late instrumentation helps many cases. In some cases (as shown in
>>> >> >> SPEC), the reduction in size is not as large. Reducing PGO overhead
>>> >> >> will lower the bar for its adoption.
>>> >> >>
>>> >> >> >
>>> >> >> > Of the points you mention in "Large size of overhead can limit the
>>> >> >> > usability
>>> >> >> > of PGO greatly", many of the issues are hard limits that prevent
>>> >> >> > the
>>> >> >> > use
>>> >> >> > of
>>> >> >> > PGO. Do you have a lower bound on how much the size of the PGO
>>> >> >> > data
>>> >> >> > must
>>> >> >> > be
>>> >> >> > reduced in order to overcome the hard limits?
>>> >> >>
>>> >> >> This is a static view:  Think about the situation where application
>>> >> >> size is ever increasing; also think about situation where we want to
>>> >> >>
>>> >> >> collect more types of profile data. Think about situation where user
>>> >> >> want to run pgo binaries on small devices with tiny memory/storage
>>> >> >> ..
>>> >> >
>>> >> >
>>> >> > If we want to reduce memory overhead at runtime and reduce the size
>>> >> > of
>>> >> > the
>>> >> > raw profile data extracted from the target, there are clear
>>> >> > solutions.
>>> >> > Consider that debug info does not need to be loaded into the memory
>>> >> > image of
>>> >> > the target; why should information identifying each counter need to
>>> >> > be?
>>> >> > A file containing raw profile counters is a subset of a core dump; in
>>> >> > most
>>> >> > environments, a core dump does not need to have debug info or symbol
>>> >> > names
>>> >> > in it, but can be still be read in full detail in conjunction with
>>> >> > the
>>> >> > original binary.
>>> >>
>>> >> Yes -- there are many alternatives:
>>> >> 1) emit the name key mapping as a side data at compile time, or
>>> >> 2) emit them into nonloadable sections of the object file.
>>> >>
>>> >> Compared with the above, LLVM's existing design does have its own
>>> >> advantage -- making it easier for tool to access 'debug' info for
>>> >> counters.
>>> >>
>>> >> LLVM's coverage testing, on the other hand, take a hybrid approach: It
>>> >> emits the coverage map as rodata, but does not pass it to the profile
>>> >> dumper. I think it is better to emit covmap as a side data not
>>> >> attached to target binary.
>>> >>
>>> >>
>>> >> >
>>> >> > Thus, as we require that the binary be passed to llvm-profdata, there
>>> >> > is
>>> >> > no
>>> >> > fundamental reason that the memory image of the program, or the raw
>>> >> > data
>>> >> > extracted from the program, must have any size overhead besides the
>>> >> > raw
>>> >> > values of the counters themselves and any text size increase for
>>> >> > incrementing them. If we are willing to impose this requirement on
>>> >> > users,
>>> >> > then as far as reducing memory overhead at runtime and reducing the
>>> >> > size
>>> >> > of
>>> >> > the raw profile data extracted from the target, using hashed function
>>> >> > names
>>> >> > is clearly the wrong direction.
>>> >> >
>>> >> > *Without* imposing the requirement of passing the binary to
>>> >> > llvm-profdata, I
>>> >> > do like the ability to use hashed function names like you are
>>> >> > proposing.
>>> >> > It
>>> >> > is a simple solution for reducing size overhead of function name
>>> >> > strings
>>> >> > with little complexity, as it is just swapping one string for
>>> >> > another.
>>> >>
>>> >> Agree. The good news is that the overhead of hashed function names is
>>> >> small enough that makes this approach attractive.
>>> >>
>>> >> thanks,
>>> >>
>>> >> David
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> > Obviously LLVM must be able to support the extremely large
>>> >> >> > binaries
>>> >> >> > in
>>> >> >> > your
>>> >> >> > configuration (otherwise what use is LLVM as a compiler ;) My
>>> >> >> > questions
>>> >> >> > are
>>> >> >> > primarily aimed at establishing which tradeoffs are acceptable for
>>> >> >> > supporting this (both for LLVM and for you guys).
>>> >> >>
>>> >> >> As I said, with the modified proposal (after getting your feedback),
>>> >> >> no PGO users in LLVM land is going to lose anything/functionality.
>>> >> >> The
>>> >> >> end result will be net win for general users of LLVM (even though
>>> >> >> your
>>> >> >> customers don't care about it), not just 'us' as you have mentioned
>>> >> >> many times.
>>> >> >>
>>> >> >> >
>>> >> >> > Btw, for us, the issue of PGO data size is not completely
>>> >> >> > immaterial
>>> >> >> > but
>>> >> >> > is
>>> >> >> > very different from your use case. For us, the primary issue is
>>> >> >> > the
>>> >> >> > additional memory use at run time, since PS4 games usually use
>>> >> >> > "all"
>>> >> >> > available memory. We had a problem with UBSan where the large
>>> >> >> > amount
>>> >> >> > of
>>> >> >> > memory required for storing the UBSan diagnostic data at runtime
>>> >> >> > required
>>> >> >> > the game programmers to manually change their memory map to make
>>> >> >> > room.
>>> >> >> > +Filipe, do you remember how much memory UBSan was using that
>>> >> >> > caused
>>> >> >> > a
>>> >> >> > problem?
>>> >> >> >
>>> >> >>
>>> >> >> My proposal does help reducing rodata size significantly.
>>> >> >
>>> >> >
>>> >> > Yes, that is why I think that this is a useful thing to do. I just
>>> >> > want
>>> >> > to
>>> >> > be careful about existing use cases and the relevant workflow issues.
>>> >> >
>>> >> > -- Sean Silva
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> David
>>> >> >>
>>> >> >>
>>> >> >> > -- Sean Silva
>>> >> >> >
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> In the raw format, md5 sum key can be an embedded field in the
>>> >> >> >> prf_data variable instead of as different var referenced by
>>> >> >> >> prf_data.
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> > If this is not the case, you should show your current patch so
>>> >> >> >> > that
>>> >> >> >> > we
>>> >> >> >> > can
>>> >> >> >> > discuss things concretely.
>>> >> >> >>
>>> >> >> >> It is not. See above about the difference.
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> >>
>>> >> >> >> >> >> It will be
>>> >> >> >> >> >> very messy to support multiple formats in instr-codegen and
>>> >> >> >> >> >> instr-runtime.  For compatibility concerns, the reader is
>>> >> >> >> >> >> taught
>>> >> >> >> >> >> to
>>> >> >> >> >> >> support previous format, but the changes there are isolated
>>> >> >> >> >> >> (also
>>> >> >> >> >> >> expected to be removed in the future).
>>> >> >> >> >> >>
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > My primary concern is that if the function name are not
>>> >> >> >> >> >> > kept
>>> >> >> >> >> >> > at
>>> >> >> >> >> >> > all
>>> >> >> >> >> >> > stages,
>>> >> >> >> >> >> > then it becomes difficult to analyze the profile data in
>>> >> >> >> >> >> > a
>>> >> >> >> >> >> > standalone
>>> >> >> >> >> >> > way.
>>> >> >> >> >> >> > Many times, I have used `llvm-profdata show
>>> >> >> >> >> >> > -all-functions
>>> >> >> >> >> >> > foo.profdata`
>>> >> >> >> >> >> > on
>>> >> >> >> >> >> > the resulting profile data and then imported that data
>>> >> >> >> >> >> > into
>>> >> >> >> >> >> > Mathematica
>>> >> >> >> >> >> > for
>>> >> >> >> >> >> > analysis.
>>> >> >> >> >> >>
>>> >> >> >> >> >> This is certainly a very valid use case.
>>> >> >> >> >> >>
>>> >> >> >> >> >> >My understanding of your proposal is that `llvm-profdata
>>> >> >> >> >> >> > show
>>> >> >> >> >> >> > -all-functions foo.profdata` will not show the actual
>>> >> >> >> >> >> > function
>>> >> >> >> >> >> > names
>>> >> >> >> >> >> > but
>>> >> >> >> >> >> > instead MD5 hashes,
>>> >> >> >> >> >>
>>> >> >> >> >> >> Yes.
>>> >> >> >> >> >>
>>> >> >> >> >> >> To support your use case, there are two solutions:
>>> >> >> >> >> >>
>>> >> >> >> >> >> 1) user can add -fcoverage-mapping option in the build
>>> >> >> >> >> >> 2) introduce a new option -fprofile-instr-names that force
>>> >> >> >> >> >> the
>>> >> >> >> >> >> emission of the name sections in the .o file. This is
>>> >> >> >> >> >> similar
>>> >> >> >> >> >> to
>>> >> >> >> >> >> 1),
>>> >> >> >> >> >> but no covmap section is needed.
>>> >> >> >> >> >>
>>> >> >> >> >> >> llvm-profdata tool  will be taught to read the name section
>>> >> >> >> >> >> and
>>> >> >> >> >> >> attach
>>> >> >> >> >> >> function names to the profile records.
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > Needing to pass the executable to llvm-profdata would cause
>>> >> >> >> >> > deployment
>>> >> >> >> >> > issues for my customers in practice.
>>> >> >> >> >>
>>> >> >> >> >> Why? The deployment needs to pass the profile data anyway
>>> >> >> >> >> right?
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > Yes, but not the executable.
>>> >> >> >> >
>>> >> >> >> > The PGO training run is likely being run by a gameplay tester
>>> >> >> >> > (non-programmer). In general the binary will not be lying
>>> >> >> >> > around
>>> >> >> >> > as a
>>> >> >> >> > loose
>>> >> >> >> > file anywhere, it will be part of a full package of the
>>> >> >> >> > binary+assets
>>> >> >> >> > (think
>>> >> >> >> > like what will end up on a bluray disc). A game's binary
>>> >> >> >> > *completely
>>> >> >> >> > useless* without the assets, so except locally on a
>>> >> >> >> > programmer's
>>> >> >> >> > machine
>>> >> >> >> > while they iterate/debug, there is no reason for a binary to
>>> >> >> >> > ever
>>> >> >> >> > exist
>>> >> >> >> > as a
>>> >> >> >> > standalone file.
>>> >> >> >> >
>>> >> >> >> > I'm not saying that needing the binary is insurmountable in any
>>> >> >> >> > particular
>>> >> >> >> > scenario. Just that it will cause a strict increase in the
>>> >> >> >> > number
>>> >> >> >> > of
>>> >> >> >> > issues
>>> >> >> >> > to deploying PGO.
>>> >> >> >>
>>> >> >> >>  Your concern is acknowledged.
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> > These are much bigger "compatibility concerns" for me than for
>>> >> >> >> > newer
>>> >> >> >> > toolchains to accept the old format. For a change in format I
>>> >> >> >> > can
>>> >> >> >> > easily
>>> >> >> >> > tell my users to replace an exe with a newer one and that is
>>> >> >> >> > all
>>> >> >> >> > they
>>> >> >> >> > need
>>> >> >> >> > to do and it takes 10 seconds, guaranteed. A workflow change is
>>> >> >> >> > potentially
>>> >> >> >> > a massive disruption and guaranteed to take more than 10
>>> >> >> >> > seconds
>>> >> >> >> > to
>>> >> >> >> > fix
>>> >> >> >> > (perhaps hours or days).
>>> >> >> >>
>>> >> >> >> ok.
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >>   This
>>> >> >> >> >> is no different from llvm-cov usage model.
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > In practice, getting the performance of PGO is a higher
>>> >> >> >> > priority
>>> >> >> >> > for
>>> >> >> >> > my
>>> >> >> >> > users, so we should not assume that llvm-cov is being used.
>>> >> >> >>
>>> >> >> >> Glad to hear that :)
>>> >> >> >>
>>> >> >> >> thanks,
>>> >> >> >>
>>> >> >> >> David
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> > -- Sean Silva
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> David
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> >
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> Note that with 1) or 2), the user can still benefit from
>>> >> >> >> >> >> the
>>> >> >> >> >> >> reduced
>>> >> >> >> >> >> profile size.
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > Let me reiterate that the size of the profile is not a
>>> >> >> >> >> > problem
>>> >> >> >> >> > I
>>> >> >> >> >> > have
>>> >> >> >> >> > observed in practice (nor have I heard of this being a
>>> >> >> >> >> > problem
>>> >> >> >> >> > in
>>> >> >> >> >> > practice
>>> >> >> >> >> > until this thread). Therefore I'm skeptical of any changes
>>> >> >> >> >> > to
>>> >> >> >> >> > our
>>> >> >> >> >> > default
>>> >> >> >> >> > behavior or any new requirements that are not opt-in.
>>> >> >> >> >> >
>>> >> >> >> >> > -- Sean Silva
>>> >> >> >> >> >
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> thanks,
>>> >> >> >> >> >>
>>> >> >> >> >> >> David
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> >which will make it more difficult for me to do this kind
>>> >> >> >> >> >> > of analysis (would require using nm on the original
>>> >> >> >> >> >> > binary,
>>> >> >> >> >> >> > hashing
>>> >> >> >> >> >> > everything, etc.).
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > btw, feel free to attach the patch even if it in a rough
>>> >> >> >> >> >> > state.
>>> >> >> >> >> >> > It
>>> >> >> >> >> >> > can
>>> >> >> >> >> >> > still
>>> >> >> >> >> >> > help to clarify the proposal and be a good talking point.
>>> >> >> >> >> >> > Fine-grained
>>> >> >> >> >> >> > patch
>>> >> >> >> >> >> > review for caring about the rough parts will happen on
>>> >> >> >> >> >> > llvm-commits;
>>> >> >> >> >> >> > the
>>> >> >> >> >> >> > rough parts will not distract the discussion here on
>>> >> >> >> >> >> > llvm-dev.
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > -- Sean Silva
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> thanks,
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> David
>>> >> >> >> >> >> >> _______________________________________________
>>> >> >> >> >> >> >> LLVM Developers mailing list
>>> >> >> >> >> >> >> llvm-dev at lists.llvm.org
>>> >> >> >> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >
>>> >> >
>>> >
>>> >
>>
>>