[LLVMdev] RFC - Improvements to PGO profile support

Wed Feb 25 14:29:08 PST 2015

On Wed, Feb 25, 2015 at 2:14 PM, Philip Reames
<listmail at philipreames.com> wrote:
>
> On 02/25/2015 12:40 PM, Xinliang David Li wrote:
>>
>> On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>>
>>> On 02/24/2015 03:31 PM, Diego Novillo wrote:
>>>
>>> Need to faithfully represent the execution count taken from dynamic
>>> profiles. Currently, MD_prof does not really represent an execution
>>> count.
>>> This makes things like comparing hotness across functions hard or
>>> impossible. We need a concept of global hotness.
>>>
>>> What does MD_prof actually represent when used from Clang?  I know I've
>>> been
>>> using it for execution counters in my frontend.  Am I approaching that
>>> wrong?
>>>
>>> As a side comment: I'm a bit leery of the notion of a consistent notion
>>> of
>>> hotness based on counters across functions.  These counters are almost
>>> always approximate in practice and counting problems run rampant.
>>
>> Having representative training runs is pre-requisite for using FDO/PGO.
>
> Representativeness is not the issue I'm raising.  Profiling systems
> (particularly instrumentation based ones) have systemic biases.  Not
> accounting for that can lead to some very odd results.  As an example:
> void foo() {
>   if (member)
>      for(int i = 0; i < 100000; i++)
>        if (member2)
>           bar();
> }
>
> With multiple threads in play, it's entirely possible that the sum of the
> absolute weights on the second branch are lower than the sum of the absolute
> counts on the first branch.  (i.e. due to racy updating)  While you can
> avoid this by using race free updates, I know of very few systems that
> actually do.

Are you speculating or you have data to show it? We have large
programs run with hundreds of threads, race condition only contribute
to very small count variations -- and there are ways to smooth out the
difference.

>
> If your optimization is radically unstable in such scenarios, that's a
> serious problem.  Pessimization is bad enough (if tolerable), incorrect
> transforms are not.

This is never our experience with using PGO  in the past.  We also
have tools to compare profile consistency from one training run to
another.

If you experience such problems in real apps, can you file a bug?

> It's very easy to write a transform that implicitly
> assumes the counts for the first branch must be less than the counts for the
> second.

Compiler can detect insane profile -- it can either ignore it, correct
it, or uses it with warnings depending on options.

>
>>
>>>   I'd
>>> almost rather see a consistent count inferred from data that's assumed to
>>> be
>>> questionable than
>>> make the frontend try to generate consistent profiling
>>> metadata.
>>
>> Frontend does not generate profile data -- it is just a messenger that
>> should pass the data faithfully to the middle end. That messenger
>> (profile reader) can be in middle end too.
>
> Er, we may be arguing terminology here.  I was including the profiling
> system as part of the "frontend" - I'm working with a JIT - whereas you're
> assuming a separate collection system.  It doesn't actually matter which
> terms we use.  My point was that assuming clean profiling data is just not
> reasonable in practice.  At minimum, some type of normalization step is
> required.

If you are talking about making slightly consistent profile to be flow
consistent, yes there are mechanisms to do that.

David

>>
>>
>>> Other than the inliner, can you list the passes you think are profitable
>>> to
>>> teach about profiling data?  My list so far is: PRE (particularly of
>>> loads!), the vectorizer (i.e. duplicate work down both a hot and cold
>>> path
>>> when it can be vectorized on the hot path), LoopUnswitch, IRCE, &
>>> LoopUnroll
>>> (avoiding code size explosion in cold code).  I'm much more interested in
>>> sources of improved performance than I am simply code size reduction.
>>> (Reducing code size can improve performance of course.)
>>
>> PGO is very effective in code size reduction. In reality, large
>> percentage of functions are globally cold.
>
> For a traditional C++ application, yes.  For a JIT which is only compiling
> warm code paths in hot methods, not so much.  It's still helpful, but the
> impact is much smaller.
>
> Philip