[PATCH] D12603: Use fixed-point representation for BranchProbability

Sat Sep 12 15:02:42 PDT 2015

On Fri, Sep 11, 2015 at 5:34 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:

>
> > On 2015-Sep-10, at 14:15, Xinliang David Li <davidxl at google.com> wrote:
> >
> > Cong,  It is more common to use power of 2 as the scaling factor in
> > fixed point representation for computation efficiency.  For 32bit
> > width, 0x80000000 is common (noted as Q31).  Here are the benefits
> >
> > 1) it is much faster to do the scale operation with Q31 -- instead of
> > using the current slow version of scale, it can call a faster version
> > to be introduced.
> >
> > This will probably help resolve bug
> > https://llvm.org/bugs/show_bug.cgi?id=24620 where 20% of the time is
> > spent in the slow version of the scale method.
>
> Is there a way to speed up scaling with UINT32_MAX?  I suspect so, but
> haven't worked it out.  (E.g., you could scale to UINT64_C(1) << 32 and
> ignore the first dither; if there's no dithering, subtract 1 from the
> last.)
>

The slowdown does not come from scaling  that forms the fixed point value,
but from:

static uint64_t scale(uint64_t Num, uint32_t N, uint32_t D) {
  ....
}

where the main contributor is the DIV/REM operations used in this method.

My profiling data shows that if DIV/REM operations are  replaced with
shift/bitwise_and operation (when D is power of 2), the runtime of the
method can be sped up by 6X.

If D is UINT32_MAX and if the constant value is actually exposed to the
compiler (manually or through const prop and cloning which is done by GCC,
not Clang), the instructor selector will convert the DIV/REM into IMUL and
SHR, which can also get lots of speedup, but compared with the power of 2
case, this version of 'scale' is still 1.6X slower.

In other words, in terms of computation efficiency, Power of 2 Denominator
has huge advantage, the implementation of Scale is also more readable.

>
> > 2) it allows negative fractional number to be represented
>
> Not relevant for probabilities, though.
>
> > 3) it is more precise for lots of common cases. For instance,
> > representation of 0.5, 0.25, 0.75 (i.e. 1/2, 1/4, 3/4) etc are
> > actually lossless with Q31 while the current choice of using
> > uint32_max can not represent them precisely.
>
> It's less precise for lots of common cases, too.  2^32's only prime
> factor is 2, but 2^32-1 has 3, 5, 17, etc.
>
> This argument seems weak either way.
>

I have done some experiments to compare the precision of two choices.  Here
is how the experiment is set up:

1) first collect a set of unique fractional numbers N/D, where D increases
from 2 to 100, and N increases from 1 to D-1. Skip N/D if GCD(D,N) is not
1.  The total number of ratios is 3403.

2)
     for any each value V in set [100, 1000000]
        for each (N,D) pair collected in 1)
             2.1) compute the reference scaled value for V
                 scale_val_ref =  scale(V, N, D)
             2.2)  compute the fixed point rep for N/D:
                  FixedPoint<SF> fp (N, D);
             2.3) compute scaled value  for V:
                  scale_val = scale(V, fp.N, SF)

             2.4) compare scale_val and scale_val_ref and collect statistics

In the above code, SF = 0x80000000 and UINT32_MAX

In step 2.2) above, there are total ~3.4 billion scaled value computed. The
experiment will collect the total number of scaling operations where
scale_val != scale_val_ref for each SF. The max diff of scale_val, and
scal_val_ref is also recorded.

Here is the result. In both cases, the max diff is 1.
When SF = 0x80000000, there are about 0.88% of total scaling operations
that produce different result from the reference value;
When SF = UINT32_MAX, there are about 0.93% of total scaling results that
differ from the references.

In short, the difference between the above two cases is 0.05%

Given the small precision difference, I tend to believe 0x80000000 is
slightly better (given its efficiency and simplicity for div/rem ops).

Regarding debugging print, you mentioned using 0xffffffff as denominator is
easier to track. Is there a reason for that? I find large numbers are
equally hard to track mentally.

thanks,

David

>
> >
> > (if needed in the future, we can also extend this to be a template
> > class where number of bits in the scaling factor is a template
> > parameter to support values > 1
> >   template <int bits> class FixedPoint {
> >     static const int ScalingFactor = 1 << bits;
> > };
> > )
>
> Sure, applies to either.
>
> >
> > In the IR dump, I find it is easier to read
> > 1) percentage,
> > 2) M/N form when N is less than 10.
>
> I think we should always print (and check) for a percentage, but if you're
> debugging something, you need to see exactly what's in memory, and the
> hex is precise and easy to reason about.
>
> >
> > Current dump of 0x10ffff/0xffffffff form is hard to read -- can then
> > be simplified to make the denominator < 10?
> >
> >
> > Duncan, what is your opinion?
>
> If I'm going to be tracking down some bug by looking at DEBUG prints, I
> suspect I'll find 0xffffffff easier to reason about than 0x80000000.  So
> if we can keep UINT32_MAX, I'd prefer that.
>
> >
> > thanks,
> >
> > David
> >
> >
> >
> >
> >
> > On Thu, Sep 10, 2015 at 12:21 PM, Cong Hou <congh at google.com> wrote:
> >> congh updated this revision to Diff 34474.
> >> congh added a comment.
> >>
> >> Update the patch by fixing block frequency / branch probability unit
> tests failures.
> >>
> >> I am not sure if this is the right way to update the unit tests. The
> fixed-point representation has precision issues comparing to the previous
> rational representation and that is why we need to update the comparison
> right answers in unit tests.
> >>
> >>
> >> http://reviews.llvm.org/D12603
> >>
> >> Files:
> >>  include/llvm/Analysis/BlockFrequencyInfoImpl.h
> >>  include/llvm/Support/BranchProbability.h
> >>  lib/Support/BranchProbability.cpp
> >>  lib/Target/ARM/ARMBaseInstrInfo.cpp
> >>  test/Analysis/BlockFrequencyInfo/basic.ll
> >>  test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
> >>  test/Analysis/BranchProbabilityInfo/basic.ll
> >>  test/Analysis/BranchProbabilityInfo/loop.ll
> >>  test/Analysis/BranchProbabilityInfo/noreturn.ll
> >>  test/Analysis/BranchProbabilityInfo/pr18705.ll
> >>  test/Analysis/BranchProbabilityInfo/pr22718.ll
> >>  test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
> >>  test/CodeGen/ARM/2013-10-11-select-stalls.ll
> >>  test/CodeGen/ARM/ifcvt4.ll
> >>  test/CodeGen/ARM/sjlj-prepare-critical-edge.ll
> >>  test/CodeGen/ARM/tail-merge-branch-weight.ll
> >>  test/CodeGen/ARM/test-sharedidx.ll
> >>  test/CodeGen/Thumb2/thumb2-ifcvt1.ll
> >>  test/Transforms/SampleProfile/branch.ll
> >>  test/Transforms/SampleProfile/calls.ll
> >>  test/Transforms/SampleProfile/discriminator.ll
> >>  test/Transforms/SampleProfile/fnptr.ll
> >>  test/Transforms/SampleProfile/propagate.ll
> >>  unittests/Support/BlockFrequencyTest.cpp
> >>  unittests/Support/BranchProbabilityTest.cpp
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150912/c06fac1e/attachment.html>