[LLVMdev] Loss of precision with very large branch weights

Fri Apr 24 12:40:41 PDT 2015

> On 2015-Apr-24, at 11:18, Diego Novillo <dnovillo at google.com> wrote:
> 
> 
> In PR 22718, we are looking at issues with long running applications producing non-representative frequencies.  For example, in these two loops:
> 
> int g = 0;
> __attribute__((noinline)) void bar() {
>  g++;
> }
> 
> extern int printf(const char*, ...);
> 
> int main()
> {
>   int i, j, k;
> 
>   for (i = 0; i < 1000000; i++)
>     bar();
> 
>   printf ("g = %d\n", g);
>   g = 0;
> 
>   for (i = 0; i < 500000; i++)
>     bar();
> 
>   printf ("g = %d\n", g);
> }
> 
> I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case:
> 
> $ bin/opt -analyze -block-freq -S unbiased-branches.ll 
> Printing analysis 'Block Frequency Analysis' for function 'bar':
> block-frequency-info: bar
>  - entry: float = 1.0, int = 8
> 
> Printing analysis 'Block Frequency Analysis' for function 'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 500001.5, int = 4000011
>  - for.body: float = 500000.5, int = 4000003
>  - for.inc: float = 500000.5, int = 4000003
>  - for.end: float = 1.0, int = 8
>  - for.cond1: float = 250001.5, int = 2000011
>  - for.body3: float = 250000.5, int = 2000003
>  - for.inc4: float = 250000.5, int = 2000003
>  - for.end6: float = 1.0, int = 8
> 
> 
> But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647

What does -analyze -branch-prob give?

> $ bin/opt -analyze -block-freq -S unbiased-branches.ll 
> Printing analysis 'Block Frequency Analysis' for function 'bar':
> block-frequency-info: bar
>  - entry: float = 1.0, int = 8
> 
> Printing analysis 'Block Frequency Analysis' for function 'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 1073741824.4, int = 8589934595
>  - for.body: float = 1073741823.4, int = 8589934587
>  - for.inc: float = 1073741823.4, int = 8589934587
>  - for.end: float = 1.0, int = 8
>  - for.cond1: float = 1073741824.4, int = 8589934595
>  - for.body3: float = 1073741823.4, int = 8589934587
>  - for.inc4: float = 1073741823.4, int = 8589934587
>  - for.end6: float = 1.0, int = 8
> 
> Now both loops are considered equally hot.
> 
> Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here.

I don't understand where the fidelity is being lost; this seems
strange to me.  4B fits inside UINT32_MAX, so I don't see how we
hit a cap here.

But certainly there *is* a limit.  We can't store a ratio of larger
than ~4B:1.  Is this really a problem?  I understand that you can't
tell which loop is hotter, but why does that matter?  Is there some
optimization that will do the wrong thing?  Which one?

Regarding changing branch weights to 64-bits, I think we should be
sure it's valuable before we change it.  I imagine it's doable, but
it'll complicate logic in a few places that don't currently have to
worry about overflowing `uint64_t`.