[PATCH] D12603: Use fixed-point representation for BranchProbability

Mon Sep 21 14:16:07 PDT 2015

On Mon, Sep 21, 2015 at 2:04 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:

>
> > On 2015-Sep-12, at 15:02, Xinliang David Li <davidxl at google.com> wrote:
> >
> >
> >
> > On Fri, Sep 11, 2015 at 5:34 PM, Duncan P. N. Exon Smith <
> dexonsmith at apple.com> wrote:
> >
> > > On 2015-Sep-10, at 14:15, Xinliang David Li <davidxl at google.com>
> wrote:
> > >
> > > Cong,  It is more common to use power of 2 as the scaling factor in
> > > fixed point representation for computation efficiency.  For 32bit
> > > width, 0x80000000 is common (noted as Q31).  Here are the benefits
> > >
> > > 1) it is much faster to do the scale operation with Q31 -- instead of
> > > using the current slow version of scale, it can call a faster version
> > > to be introduced.
> > >
> > > This will probably help resolve bug
> > > https://llvm.org/bugs/show_bug.cgi?id=24620 where 20% of the time is
> > > spent in the slow version of the scale method.
> >
> > Is there a way to speed up scaling with UINT32_MAX?  I suspect so, but
> > haven't worked it out.  (E.g., you could scale to UINT64_C(1) << 32 and
> > ignore the first dither; if there's no dithering, subtract 1 from the
> > last.)
> >
> > The slowdown does not come from scaling  that forms the fixed point
> value, but from:
> >
> > static uint64_t scale(uint64_t Num, uint32_t N, uint32_t D) {
> >   ....
> > }
> >
> > where the main contributor is the DIV/REM operations used in this method.
> >
> > My profiling data shows that if DIV/REM operations are  replaced with
> shift/bitwise_and operation (when D is power of 2), the runtime of the
> method can be sped up by 6X.
> >
> > If D is UINT32_MAX and if the constant value is actually exposed to the
> compiler (manually or through const prop and cloning which is done by GCC,
> not Clang), the instructor selector will convert the DIV/REM into IMUL and
> SHR, which can also get lots of speedup, but compared with the power of 2
> case, this version of 'scale' is still 1.6X slower.
> >
> > In other words, in terms of computation efficiency, Power of 2
> Denominator has huge advantage, the implementation of Scale is also more
> readable.
> >
> >
> >
> > > 2) it allows negative fractional number to be represented
> >
> > Not relevant for probabilities, though.
> >
> > > 3) it is more precise for lots of common cases. For instance,
> > > representation of 0.5, 0.25, 0.75 (i.e. 1/2, 1/4, 3/4) etc are
> > > actually lossless with Q31 while the current choice of using
> > > uint32_max can not represent them precisely.
> >
> > It's less precise for lots of common cases, too.  2^32's only prime
> > factor is 2, but 2^32-1 has 3, 5, 17, etc.
> >
> > This argument seems weak either way.
> >
> > I have done some experiments to compare the precision of two choices.
> Here is how the experiment is set up:
> >
> > 1) first collect a set of unique fractional numbers N/D, where D
> increases from 2 to 100, and N increases from 1 to D-1. Skip N/D if
> GCD(D,N) is not 1.  The total number of ratios is 3403.
> >
> > 2)
> >      for any each value V in set [100, 1000000]
> >         for each (N,D) pair collected in 1)
> >              2.1) compute the reference scaled value for V
> >                  scale_val_ref =  scale(V, N, D)
> >              2.2)  compute the fixed point rep for N/D:
> >                   FixedPoint<SF> fp (N, D);
> >              2.3) compute scaled value  for V:
> >                   scale_val = scale(V, fp.N, SF)
> >
> >              2.4) compare scale_val and scale_val_ref and collect
> statistics
> >
> > In the above code, SF = 0x80000000 and UINT32_MAX
> >
> > In step 2.2) above, there are total ~3.4 billion scaled value computed.
> The experiment will collect the total number of scaling operations where
> scale_val != scale_val_ref for each SF. The max diff of scale_val, and
> scal_val_ref is also recorded.
> >
> > Here is the result. In both cases, the max diff is 1.
> > When SF = 0x80000000, there are about 0.88% of total scaling operations
> that produce different result from the reference value;
> > When SF = UINT32_MAX, there are about 0.93% of total scaling results
> that differ from the references.
> >
> > In short, the difference between the above two cases is 0.05%
>
> This looks like 7% exact vs. 12% exact, which is almost double for this
> data set.  Fairly big difference.
>

99.07% vs 99.12% exact, right?

>
> Although, the idea that adding almost entire extra bit could make
> precision *worse* is kind of strange.  I'm a little skeptical that this
> calculation is useful.
>
> > Given the small precision difference, I tend to believe 0x80000000 is
> slightly better (given its efficiency and simplicity for div/rem ops).
> >
> > Regarding debugging print, you mentioned using 0xffffffff as denominator
> is easier to track. Is there a reason for that? I find large numbers are
> equally hard to track mentally.
>
> I find that there's less "carrying" when adding numbers together.
> Anecdotally, when adding a bunch of numbers together, it's "having to
> carry digits" that most increases the likelihood I'll have to take
> notes.
>
> In the end this probably doesn't matter too much.  I'm fine either way.
>

thanks!  Just want to be clear, you are fine with (1<<31) as the scaling
factor, right?

David

>
> >
> > thanks,
> >
> > David
> >
> >
> > >
> > > (if needed in the future, we can also extend this to be a template
> > > class where number of bits in the scaling factor is a template
> > > parameter to support values > 1
> > >   template <int bits> class FixedPoint {
> > >     static const int ScalingFactor = 1 << bits;
> > > };
> > > )
> >
> > Sure, applies to either.
> >
> > >
> > > In the IR dump, I find it is easier to read
> > > 1) percentage,
> > > 2) M/N form when N is less than 10.
> >
> > I think we should always print (and check) for a percentage, but if
> you're
> > debugging something, you need to see exactly what's in memory, and the
> > hex is precise and easy to reason about.
> >
> > >
> > > Current dump of 0x10ffff/0xffffffff form is hard to read -- can then
> > > be simplified to make the denominator < 10?
> > >
> > >
> > > Duncan, what is your opinion?
> >
> > If I'm going to be tracking down some bug by looking at DEBUG prints, I
> > suspect I'll find 0xffffffff easier to reason about than 0x80000000.  So
> > if we can keep UINT32_MAX, I'd prefer that.
> >
> > >
> > > thanks,
> > >
> > > David
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Sep 10, 2015 at 12:21 PM, Cong Hou <congh at google.com> wrote:
> > >> congh updated this revision to Diff 34474.
> > >> congh added a comment.
> > >>
> > >> Update the patch by fixing block frequency / branch probability unit
> tests failures.
> > >>
> > >> I am not sure if this is the right way to update the unit tests. The
> fixed-point representation has precision issues comparing to the previous
> rational representation and that is why we need to update the comparison
> right answers in unit tests.
> > >>
> > >>
> > >> http://reviews.llvm.org/D12603
> > >>
> > >> Files:
> > >>  include/llvm/Analysis/BlockFrequencyInfoImpl.h
> > >>  include/llvm/Support/BranchProbability.h
> > >>  lib/Support/BranchProbability.cpp
> > >>  lib/Target/ARM/ARMBaseInstrInfo.cpp
> > >>  test/Analysis/BlockFrequencyInfo/basic.ll
> > >>  test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
> > >>  test/Analysis/BranchProbabilityInfo/basic.ll
> > >>  test/Analysis/BranchProbabilityInfo/loop.ll
> > >>  test/Analysis/BranchProbabilityInfo/noreturn.ll
> > >>  test/Analysis/BranchProbabilityInfo/pr18705.ll
> > >>  test/Analysis/BranchProbabilityInfo/pr22718.ll
> > >>  test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
> > >>  test/CodeGen/ARM/2013-10-11-select-stalls.ll
> > >>  test/CodeGen/ARM/ifcvt4.ll
> > >>  test/CodeGen/ARM/sjlj-prepare-critical-edge.ll
> > >>  test/CodeGen/ARM/tail-merge-branch-weight.ll
> > >>  test/CodeGen/ARM/test-sharedidx.ll
> > >>  test/CodeGen/Thumb2/thumb2-ifcvt1.ll
> > >>  test/Transforms/SampleProfile/branch.ll
> > >>  test/Transforms/SampleProfile/calls.ll
> > >>  test/Transforms/SampleProfile/discriminator.ll
> > >>  test/Transforms/SampleProfile/fnptr.ll
> > >>  test/Transforms/SampleProfile/propagate.ll
> > >>  unittests/Support/BlockFrequencyTest.cpp
> > >>  unittests/Support/BranchProbabilityTest.cpp
> > >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150921/13590883/attachment.html>