[LLVMdev] RFC - Improvements to PGO profile support

Thu Feb 26 13:40:04 PST 2015

On Thu, Feb 26, 2015 at 12:50 PM, Bob Wilson <bob.wilson at apple.com> wrote:
>
> On Feb 24, 2015, at 3:31 PM, Diego Novillo <dnovillo at google.com> wrote:
>
>
> We (Google) have started to look more closely at the profiling
> infrastructure in LLVM. Internally, we have a large dependency on PGO to get
> peak performance in generated code.
>
> Some of the dependencies we have on profiling are still not present in LLVM
> (e.g., the inliner) but we will still need to incorporate changes to support
> our work on these optimizations. Some of the changes may be addressed as
> individual bug fixes on the existing profiling infrastructure. Other changes
> may be better implemented as either new extensions or as replacements of
> existing code.
>
> I think we will try to minimize infrastructure replacement at least in the
> short/medium term. After all, it doesn't make too much sense to replace
> infrastructure that is broken for code that doesn't exist yet.
>
> David Li and I are preparing a document where we describe the major issues
> that we'd like to address. The document is a bit on the lengthy side, so it
> may be easier to start with an email discussion. This is a summary of the
> main changes we are looking at:
>
> Need to faithfully represent the execution count taken from dynamic
> profiles. Currently, MD_prof does not really represent an execution count.
> This makes things like comparing hotness across functions hard or
> impossible. We need a concept of global hotness.
>
>
> The plan that we have discussed in the past (I don’t remember when) was to
> record simple function entry execution counts. Those could be combined with
> the  BlockFrequencyInfo to compare “hotness” across functions.

Yes -- there are two aspects of the problems.
1) raw profile data representation in IR and
2) the profile count data represented for CFG.

What you said is for 2) which is one of the possibilities. There is a
third issue that is also going to be covered in more detail -- that is
the Block Frequency propagation algorithm is limited (leading to
information loss). When profile count is available, block frequency
data can be directly computed via simple normalization and scaling.
This requires the raw edge count data to be represented in 1)
truthfully.

>
> When the CFG or callgraph change, there need to exist an API for
> incrementally updating/scaling counts. For instance, when a function is
> inlined or partially inlined, when the CFG is modified, etc. These counts
> need to be updated incrementally (or perhaps re-computed as a first step
> into that direction).
>
> One of the main reasons that we chose to use branch weights to represent
> profile information within functions is that it makes this problem easier.
> Of course, we still need to update the branch weights when transforming the
> CFG, but I believe most of that work has already been done. Are you
> suggesting that we should work on incremental BlockFrequencyInfo updates? We
> have discussed that in the past, but so far, it has worked reasonably well
> to just re-run that analysis. (I wouldn’t be surprised if we’re missing some
> places where the analysis needs to be invalidated so that it gets re-run.)

Diego is going to share the proposal in more detail. I will give a
brief summary to answer your question:

1) making raw profile data (in MD_prof) to truthfully does not change
its original meaning -- the branch count is still branch weight -- but
it happens to be also execution count which contains more information.
2) At CFG level (BranchProbabilityInfo), using real branch probability
does not change the way branch weight is used either -- the branch
probability is still branch weight (but normalized to be 0 <= prob
<=1). Benefits and reasons behind that will be provided.

The infrastructure is ready to update the raw MD_prof data during
intra-procedural transformations when branch instructions are cloned,
but not CFG level branch prob and block-frequency/block count update.
This is especially important for inter-procedural transformations like
inliner (i.e.,  update the profile data associated with inline
instance in the caller context). Recomputing via freq propagation is
not only not precise (as mentioned above), but also very compile time
consuming.

thanks,

David

>
> The inliner (and other optimizations) needs to use profile information and
> update it accordingly. This is predicated on Chandler's work on the pass
> manager, of course.
> Need to represent global profile summary data. For example, for global
> hotness determination, it is useful to compute additional global summary
> info, such as a histogram of counts that can be used to determine hotness
> and working set size estimates for a large percentage of the profiled
> execution.
>
> There are other changes that we will need to incorporate. David, Teresa,
> Chandler, please add anything large that I missed.
>
> My main question at the moment is what would be the best way of addressing
> them. Some seem to require new concepts to be implemented (e.g., execution
> counts). Others could be addressed as simple bugs to be fixed in the current
> framework.
>
> Would it make sense to present everything in a unified document and discuss
> that? I've got some reservations about that approach because we will end up
> discussing everything at once and it may not lead to concrete progress.
> Another approach would be to present each issue individually either as
> patches or RFCs or bugs.
>
> I will be taking on the implementation of several of these issues. Some of
> them involve the SamplePGO harness that I added last year. I would also like
> to know what other bugs or problems people have in mind that I could also
> roll into this work.
>
>
> Thanks. Diego.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>