[LLVMdev] RFC - Improvements to PGO profile support

Tue Mar 10 21:44:37 PDT 2015

> [ivan] What new data type do you suggest? Increasing it to 64 bit for
> 64-bit architectures would be reasonable, and would reduce the need of
> scaling counts.

64 bit int in MD_prof.

>
>>
>>
>>> The first issue raised is that profile execution counts are not
>>> represented in the IR. This was a very intentional decision. I know it
>>> goes
>>> against what other compilers have done in the past. It took me a while
>>> to
>>> get used to the idea when Andy first suggested it, so I know it seems
>>> awkward at first. The advantage is that branch probabilities are much
>>> easier to keep updated in the face of compiler transformations, compared
>>> to
>>> execution counts.
>>>
>>
>> Sorry. I don't follow. Updating counts as the CFG is transformed is not
>> difficult at all. What examples do you have in mind?  The big advantage of
>> making MD_prof an actual execution count is that it is a meaningful metric
>> wrt scaling and transformation.
>>
>> Say, for instance, that we have a branch instruction with two targets with
>> counts {100, 300} inside a function 'foo' that has entry count 2. The edge
>> probability for the first edge (count 100) is 100/(100+300) = 25%.
>>
>> If we inline foo() inside another function bar() at a callsite with
>> profile
>> count == 1, the cloned branch instruction gets its counters scaled with
>> the
>> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} =
>> {50, 150}.  But the branch probability did not change. Currently, we are
>> cloning the branch without changing the edge weights.
>>
>> This scaling is not difficult at all and can be incrementally very
>> quickly.
>> We cannot afford to recompute all frequencies on the fly because it would
>> be detrimental to compile time. If foo() itself has N callees inlined into
>> it, each inlined callee needs to trigger a re-computation. When foo() is
>> inlined into bar(), the frequencies will need to be recomputed for foo()
>> and all N callees inlined into foo().
>>
>
> [ivan] Scaling branch counts when inlining looks good, but the current PGO
> support doesn't maintain function entry counts at LLVM IR. How do you plan
> to maintain these at LLVM IR?

MD_prof is the only thing that needs to be maintained at IR level. For
in-memory CFG profile representation, function entry count is needed.

This means when CFG profile data is incrementally updated during
inlining, the MD_profile weight/count for the cloned branches needs to
be scaled in the same way.

>
> As shown above, the three calls to bar() get float BBFreq of 1000000 (in
> for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28).
>
> When the example is modified for trip counts of 10, 100, and 1000
>> opt -analyze -block-freq test1.10.ll
>
> Printing analysis 'Block Frequency Analysis' for function 'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 11.0, int = 87
>  - for.cond1: float = 110.0, int = 879
>  - for.cond4: float = 1100.0, int = 8799
>  - for.body6: float = 1000.0, int = 7999
>  - for.inc7: float = 100.0, int = 799
>  - for.inc10: float = 10.0, int = 79
>  - for.end12: float = 1.0, int = 8
>  - for.cond13: float = 11.0, int = 87
>  - for.cond16: float = 1010.0, int = 8079
>  - for.body18: float = 1000.0, int = 7999
>  - for.inc22: float = 10.0, int = 79
>  - for.end24: float = 1.0, int = 8
>  - for.cond26: float = 1001.0, int = 8007
>  - for.body28: float = 1000.0, int = 7999
>  - for.end31: float = 1.0, int = 8
>
> the three calls to bar() get the correct count of 1000.
> We should try to remove the limitation of 4096 in the block frequency
> propagation algorithm if possible.

4096 is needed for static profile estimation -- there is no need to
remove the limit when real profile data is available -- the solution
is to simply not use the frequency propagation at all.

For your example,  the values 11, 1010 etc are all bogus.  When loop
trip count is small, the ratio of loop body count and preheader count
(average of trip count) can be very different from the real trip count
(see PR22719).  In the same bug, there is another example (multi-entry
loop) showing the weakness.

Storing profile count in MD_prof allows the compiler to skip the freq
propagation completely. It also allows faster incremental update
during inlining via simple scaling.

David

>
> Ivan
>
>