[LLVMdev] RFC - Improvements to PGO profile support

Ivan Baev ibaev at codeaurora.org
Tue Mar 10 18:30:20 PDT 2015


Hi Diego,

I work for Qualcomm and we are also interested in improvements to PGO
profile support, especially for supporting relative hotness across
functions. We are willing to contribute in this area so please keep me in
this discussion and work distribution. Some questions and comments below.


> Date: Tue, 10 Mar 2015 13:14:22 -0400
> From: Diego Novillo <dnovillo at google.com>
> To: Bob Wilson <bob.wilson at apple.com>
> Cc: Xinliang David Li <davidxl at google.com>,	LLVM Developers Mailing
> 	List <llvmdev at cs.uiuc.edu>
> Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support
> Message-ID:
> 	<CAD_=9DSzNSqpku0FjpeLyugaxshcxY9fLyyZ5EmD=i5TVDiqRA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com> wrote:
>
>>
>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com> wrote:
>>
>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at google.com>
>> wrote:
>>
>> I've created a few bugzilla issues with details of some of the things
>> I'll
>>> be looking into. I'm not yet done wordsmithing the overall design
>>> document.
>>> I'll try to finish it by early next week at the latest.
>>>
>>
>> The document is available at
>>
>>
>> https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>> <https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>
>> There are several topics covered. Ideally, I would prefer that we
>> discuss
>> each topic separately. The main ones I will start working on are the
>> ones
>> described in the bugzilla links we have in the doc.
>>
>> This is just a starting point for us. I am not at all concerned with
>> implementing exactly what is proposed in the document. In fact, if we
>> can
>> get the same value using the existing support, all the better.
>>
>> OTOH, any other ideas that folks may have that work better than this are
>> more than welcome. I don't have really strong opinions on the matter. I
>> am
>> fine with whatever works.
>>
>>
>> Thanks for the detailed write-up on this. Some of the issues definitely
>> need to be addressed. I am concerned, though, that some of the ideas may
>> be
>> leading toward a scenario where we have essentially two completely
>> different ways of representing profile information in LLVM IR. It is
>> great
>> to have two complementary approaches to collecting profile data, but two
>> representations in the IR would not make sense.
>>
>
> Yeah, I don't think I'll continue to push for a new MD_count attribute. If
> we were to make MD_prof be a "real" execution count, that would be enough.
> Note that by re-using MD_prof we are not changing its meaning at all. The
> execution count is still a weight and the ratio is still branch
> probability. All that we are changing are the absolute values of the
> number
> and increasing its data type width to remove the 32bit limitation.
>

[ivan] What new data type do you suggest? Increasing it to 64 bit for
64-bit architectures would be reasonable, and would reduce the need of
scaling counts.

>
>
>> The first issue raised is that profile execution counts are not
>> represented in the IR. This was a very intentional decision. I know it
>> goes
>> against what other compilers have done in the past. It took me a while
>> to
>> get used to the idea when Andy first suggested it, so I know it seems
>> awkward at first. The advantage is that branch probabilities are much
>> easier to keep updated in the face of compiler transformations, compared
>> to
>> execution counts.
>>
>
> Sorry. I don't follow. Updating counts as the CFG is transformed is not
> difficult at all. What examples do you have in mind?  The big advantage of
> making MD_prof an actual execution count is that it is a meaningful metric
> wrt scaling and transformation.
>
> Say, for instance, that we have a branch instruction with two targets with
> counts {100, 300} inside a function 'foo' that has entry count 2. The edge
> probability for the first edge (count 100) is 100/(100+300) = 25%.
>
> If we inline foo() inside another function bar() at a callsite with
> profile
> count == 1, the cloned branch instruction gets its counters scaled with
> the
> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} =
> {50, 150}.  But the branch probability did not change. Currently, we are
> cloning the branch without changing the edge weights.
>
> This scaling is not difficult at all and can be incrementally very
> quickly.
> We cannot afford to recompute all frequencies on the fly because it would
> be detrimental to compile time. If foo() itself has N callees inlined into
> it, each inlined callee needs to trigger a re-computation. When foo() is
> inlined into bar(), the frequencies will need to be recomputed for foo()
> and all N callees inlined into foo().
>

[ivan] Scaling branch counts when inlining looks good, but the current PGO
support doesn't maintain function entry counts at LLVM IR. How do you plan
to maintain these at LLVM IR?

>
>> We are definitely missing the per-function execution counts that are
>> needed to be able to compare relative ?hotness? across functions, and I
>> think that would be a good place to start making improvements. In the
>> long
>> term, we should keep our options open to making major changes, but
>> before
>> we go there, we should try to make incremental improvements to fix the
>> existing infrastructure.
>>
>
> Right, and that's the core of our proposal. We don't really want to make
> major infrastructure changes at this point. The only thing I'd like to
> explore is making MD_prof a real count. This will be useful for the
> inliner
> changes and it would also make incremental updates easier, because the
> scaling that needs to be done is very straightforward and quick.
>
> Note that this change should not modify the current behaviour we get from
> profile analysis. Things that were hot before should continue to be hot
> now.
>
>
>> Many of the other issues you raise seem like they could also be
>> addressed
>> without major changes to the existing infrastructure. Let?s try to fix
>> those first.
>>
>
> That's exactly the point of the proposal.  We definitely don't want to
> make
> major changes to the infrastructure at first. My thinking is to start
> working on making MD_prof a real count. One of the things that are
> happening is that the combination of real profile data plus the frequency
> propagation that we are currently doing is misleading the analysis.
>
> For example (thanks David for the code and data). In the following code:
>
> int g;
> __attribute__((noinline)) void bar() {
>  g++;
> }
>
> extern int printf(const char*, ...);
>
> int main()
> {
>   int i, j, k;
>
>   g = 0;
>
>   // Loop 1.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 100; j++)
>        for (k = 0; k < 100; k++)
>            bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>   // Loop 2.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 10000; j++)
>         bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>
>   // Loop 3.
>   for (i = 0; i < 1000000; i++)
>     bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
> }
>
> When compiled with profile instrumentation, frequency propagation is
> distorting the real profile because it gives different frequency to the
> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
> times, but after frequency propagation, the first call to bar() gets a
> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3. In
> reality, every call to bar() should have a weight of 1,000,000.

[ivan] I ran the example with
> opt -analyze -block-freq test1.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
 - entry: float = 1.0, int = 8
 - for.cond: float = 101.0, int = 807
 - for.cond1: float = 10100.0, int = 80799
 - for.cond4: float = 1010000.0, int = 8079999
 - for.body6: float = 1000000.0, int = 7999999
 - for.inc7: float = 10000.0, int = 79999
 - for.inc10: float = 100.0, int = 799
 - for.end12: float = 1.0, int = 8
 - for.cond13: float = 101.0, int = 807
 - for.cond16: float = 409600.0, int = 3276799
 - for.body18: float = 409559.0441, int = 3276472
 - for.inc22: float = 100.0, int = 799
 - for.end24: float = 1.0, int = 8
 - for.cond26: float = 4096.0, int = 32768
 - for.body28: float = 4095.995904, int = 32767
 - for.end31: float = 1.0, int = 8

As shown above, the three calls to bar() get float BBFreq of 1000000 (in
for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28).

When the example is modified for trip counts of 10, 100, and 1000
> opt -analyze -block-freq test1.10.ll

Printing analysis 'Block Frequency Analysis' for function 'main':
block-frequency-info: main
 - entry: float = 1.0, int = 8
 - for.cond: float = 11.0, int = 87
 - for.cond1: float = 110.0, int = 879
 - for.cond4: float = 1100.0, int = 8799
 - for.body6: float = 1000.0, int = 7999
 - for.inc7: float = 100.0, int = 799
 - for.inc10: float = 10.0, int = 79
 - for.end12: float = 1.0, int = 8
 - for.cond13: float = 11.0, int = 87
 - for.cond16: float = 1010.0, int = 8079
 - for.body18: float = 1000.0, int = 7999
 - for.inc22: float = 10.0, int = 79
 - for.end24: float = 1.0, int = 8
 - for.cond26: float = 1001.0, int = 8007
 - for.body28: float = 1000.0, int = 7999
 - for.end31: float = 1.0, int = 8

the three calls to bar() get the correct count of 1000.
We should try to remove the limitation of 4096 in the block frequency
propagation algorithm if possible.

Ivan





More information about the llvm-dev mailing list