[LLVMdev] Loss of precision with very large branch weights

Fri Apr 24 12:51:15 PDT 2015

On Fri, Apr 24, 2015 at 12:40 PM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:
>
>> On 2015-Apr-24, at 11:18, Diego Novillo <dnovillo at google.com> wrote:
>>
>>
>> In PR 22718, we are looking at issues with long running applications producing non-representative frequencies.  For example, in these two loops:
>>
>> int g = 0;
>> __attribute__((noinline)) void bar() {
>>  g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>>   int i, j, k;
>>
>>   for (i = 0; i < 1000000; i++)
>>     bar();
>>
>>   printf ("g = %d\n", g);
>>   g = 0;
>>
>>   for (i = 0; i < 500000; i++)
>>     bar();
>>
>>   printf ("g = %d\n", g);
>> }
>>
>> I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case:
>>
>> $ bin/opt -analyze -block-freq -S unbiased-branches.ll
>> Printing analysis 'Block Frequency Analysis' for function 'bar':
>> block-frequency-info: bar
>>  - entry: float = 1.0, int = 8
>>
>> Printing analysis 'Block Frequency Analysis' for function 'main':
>> block-frequency-info: main
>>  - entry: float = 1.0, int = 8
>>  - for.cond: float = 500001.5, int = 4000011
>>  - for.body: float = 500000.5, int = 4000003
>>  - for.inc: float = 500000.5, int = 4000003
>>  - for.end: float = 1.0, int = 8
>>  - for.cond1: float = 250001.5, int = 2000011
>>  - for.body3: float = 250000.5, int = 2000003
>>  - for.inc4: float = 250000.5, int = 2000003
>>  - for.end6: float = 1.0, int = 8
>>
>>
>> But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647
>
> What does -analyze -branch-prob give?
>
>> $ bin/opt -analyze -block-freq -S unbiased-branches.ll
>> Printing analysis 'Block Frequency Analysis' for function 'bar':
>> block-frequency-info: bar
>>  - entry: float = 1.0, int = 8
>>
>> Printing analysis 'Block Frequency Analysis' for function 'main':
>> block-frequency-info: main
>>  - entry: float = 1.0, int = 8
>>  - for.cond: float = 1073741824.4, int = 8589934595
>>  - for.body: float = 1073741823.4, int = 8589934587
>>  - for.inc: float = 1073741823.4, int = 8589934587
>>  - for.end: float = 1.0, int = 8
>>  - for.cond1: float = 1073741824.4, int = 8589934595
>>  - for.body3: float = 1073741823.4, int = 8589934587
>>  - for.inc4: float = 1073741823.4, int = 8589934587
>>  - for.end6: float = 1.0, int = 8
>>
>> Now both loops are considered equally hot.
>>
>> Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here.
>
> I don't understand where the fidelity is being lost; this seems
> strange to me.  4B fits inside UINT32_MAX, so I don't see how we
> hit a cap here.

The max weight is UINT32_MAX/BB->GetTerminator()->GetNumSuccessors().

>
> But certainly there *is* a limit.  We can't store a ratio of larger
> than ~4B:1.  Is this really a problem?

Right, this is not the issue. The issue is that when one branch's
weight reaches the cap, the other branch's weight is not scaled
accordingly:

 uint32_t WeightLimit = getMaxWeightFor(BB);
Weights.reserve(TI->getNumSuccessors());
  for (unsigned i = 1, e = WeightsNode->getNumOperands(); i != e; ++i) {
    ConstantInt *Weight =
        mdconst::dyn_extract<ConstantInt>(WeightsNode->getOperand(i));
    if (!Weight)
      return false;
    Weights.push_back(
      std::max<uint32_t>(1, Weight->getLimitedValue(WeightLimit)));
<---- may hit limit

>I understand that you can't
> tell which loop is hotter, but why does that matter?  Is there some
> optimization that will do the wrong thing?  Which one?
>

Two calls (in the loop) not equally hot may be considered equally hot
-- confusing the cost/benefit analysis. This is just an example.

Our goal is to make sure profile data is faithfully kept. Which pass
or whether it will be used now or in the future is a totally different
matter.

> Regarding changing branch weights to 64-bits, I think we should be
> sure it's valuable before we change it.  I imagine it's doable, but
> it'll complicate logic in a few places that don't currently have to
> worry about overflowing `uint64_t`.

I think we can fix this without using 64bit weight.

David