[PATCH] D12603: Use fixed-point representation for BranchProbability

Thu Sep 24 17:06:47 PDT 2015

> On 2015-Sep-21, at 14:16, Xinliang David Li <davidxl at google.com> wrote:
> 
> 
> 
> On Mon, Sep 21, 2015 at 2:04 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
> 
> > On 2015-Sep-12, at 15:02, Xinliang David Li <davidxl at google.com> wrote:
> >
> >
> >
> > On Fri, Sep 11, 2015 at 5:34 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
> >
> > > On 2015-Sep-10, at 14:15, Xinliang David Li <davidxl at google.com> wrote:
> > >
> > > Cong,  It is more common to use power of 2 as the scaling factor in
> > > fixed point representation for computation efficiency.  For 32bit
> > > width, 0x80000000 is common (noted as Q31).  Here are the benefits
> > >
> > > 1) it is much faster to do the scale operation with Q31 -- instead of
> > > using the current slow version of scale, it can call a faster version
> > > to be introduced.
> > >
> > > This will probably help resolve bug
> > > https://llvm.org/bugs/show_bug.cgi?id=24620 where 20% of the time is
> > > spent in the slow version of the scale method.
> >
> > Is there a way to speed up scaling with UINT32_MAX?  I suspect so, but
> > haven't worked it out.  (E.g., you could scale to UINT64_C(1) << 32 and
> > ignore the first dither; if there's no dithering, subtract 1 from the
> > last.)
> >
> > The slowdown does not come from scaling  that forms the fixed point value, but from:
> >
> > static uint64_t scale(uint64_t Num, uint32_t N, uint32_t D) {
> >   ....
> > }
> >
> > where the main contributor is the DIV/REM operations used in this method.
> >
> > My profiling data shows that if DIV/REM operations are  replaced with shift/bitwise_and operation (when D is power of 2), the runtime of the method can be sped up by 6X.
> >
> > If D is UINT32_MAX and if the constant value is actually exposed to the compiler (manually or through const prop and cloning which is done by GCC, not Clang), the instructor selector will convert the DIV/REM into IMUL and SHR, which can also get lots of speedup, but compared with the power of 2 case, this version of 'scale' is still 1.6X slower.
> >
> > In other words, in terms of computation efficiency, Power of 2 Denominator has huge advantage, the implementation of Scale is also more readable.
> >
> >
> >
> > > 2) it allows negative fractional number to be represented
> >
> > Not relevant for probabilities, though.
> >
> > > 3) it is more precise for lots of common cases. For instance,
> > > representation of 0.5, 0.25, 0.75 (i.e. 1/2, 1/4, 3/4) etc are
> > > actually lossless with Q31 while the current choice of using
> > > uint32_max can not represent them precisely.
> >
> > It's less precise for lots of common cases, too.  2^32's only prime
> > factor is 2, but 2^32-1 has 3, 5, 17, etc.
> >
> > This argument seems weak either way.
> >
> > I have done some experiments to compare the precision of two choices.  Here is how the experiment is set up:
> >
> > 1) first collect a set of unique fractional numbers N/D, where D increases from 2 to 100, and N increases from 1 to D-1. Skip N/D if GCD(D,N) is not 1.  The total number of ratios is 3403.
> >
> > 2)
> >      for any each value V in set [100, 1000000]
> >         for each (N,D) pair collected in 1)
> >              2.1) compute the reference scaled value for V
> >                  scale_val_ref =  scale(V, N, D)
> >              2.2)  compute the fixed point rep for N/D:
> >                   FixedPoint<SF> fp (N, D);
> >              2.3) compute scaled value  for V:
> >                   scale_val = scale(V, fp.N, SF)
> >
> >              2.4) compare scale_val and scale_val_ref and collect statistics
> >
> > In the above code, SF = 0x80000000 and UINT32_MAX
> >
> > In step 2.2) above, there are total ~3.4 billion scaled value computed. The experiment will collect the total number of scaling operations where scale_val != scale_val_ref for each SF. The max diff of scale_val, and scal_val_ref is also recorded.
> >
> > Here is the result. In both cases, the max diff is 1.
> > When SF = 0x80000000, there are about 0.88% of total scaling operations that produce different result from the reference value;
> > When SF = UINT32_MAX, there are about 0.93% of total scaling results that differ from the references.
> >
> > In short, the difference between the above two cases is 0.05%
> 
> This looks like 7% exact vs. 12% exact, which is almost double for this
> data set.  Fairly big difference.
> 
> 99.07% vs 99.12% exact, right?

Ah, I misread your numbers.  I thought they were suspiciously bad ;).

>  
> 
> Although, the idea that adding almost entire extra bit could make
> precision *worse* is kind of strange.  I'm a little skeptical that this
> calculation is useful.
> 
> > Given the small precision difference, I tend to believe 0x80000000 is slightly better (given its efficiency and simplicity for div/rem ops).
> >
> > Regarding debugging print, you mentioned using 0xffffffff as denominator is easier to track. Is there a reason for that? I find large numbers are equally hard to track mentally.
> 
> I find that there's less "carrying" when adding numbers together.
> Anecdotally, when adding a bunch of numbers together, it's "having to
> carry digits" that most increases the likelihood I'll have to take
> notes.
> 
> In the end this probably doesn't matter too much.  I'm fine either way.
> 
> thanks!  Just want to be clear, you are fine with (1<<31) as the scaling factor, right?
> 

Yes, I'm fine with this (although I still prefer the other way).

> >
> > thanks,
> >
> > David
> >
> >
> > >
> > > (if needed in the future, we can also extend this to be a template
> > > class where number of bits in the scaling factor is a template
> > > parameter to support values > 1
> > >   template <int bits> class FixedPoint {
> > >     static const int ScalingFactor = 1 << bits;
> > > };
> > > )
> >
> > Sure, applies to either.
> >
> > >
> > > In the IR dump, I find it is easier to read
> > > 1) percentage,
> > > 2) M/N form when N is less than 10.
> >
> > I think we should always print (and check) for a percentage, but if you're
> > debugging something, you need to see exactly what's in memory, and the
> > hex is precise and easy to reason about.
> >
> > >
> > > Current dump of 0x10ffff/0xffffffff form is hard to read -- can then
> > > be simplified to make the denominator < 10?
> > >
> > >
> > > Duncan, what is your opinion?
> >
> > If I'm going to be tracking down some bug by looking at DEBUG prints, I
> > suspect I'll find 0xffffffff easier to reason about than 0x80000000.  So
> > if we can keep UINT32_MAX, I'd prefer that.
> >
> > >
> > > thanks,
> > >
> > > David
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Sep 10, 2015 at 12:21 PM, Cong Hou <congh at google.com> wrote:
> > >> congh updated this revision to Diff 34474.
> > >> congh added a comment.
> > >>
> > >> Update the patch by fixing block frequency / branch probability unit tests failures.
> > >>
> > >> I am not sure if this is the right way to update the unit tests. The fixed-point representation has precision issues comparing to the previous rational representation and that is why we need to update the comparison right answers in unit tests.
> > >>
> > >>
> > >> http://reviews.llvm.org/D12603
> > >>
> > >> Files:
> > >>  include/llvm/Analysis/BlockFrequencyInfoImpl.h
> > >>  include/llvm/Support/BranchProbability.h
> > >>  lib/Support/BranchProbability.cpp
> > >>  lib/Target/ARM/ARMBaseInstrInfo.cpp
> > >>  test/Analysis/BlockFrequencyInfo/basic.ll
> > >>  test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
> > >>  test/Analysis/BranchProbabilityInfo/basic.ll
> > >>  test/Analysis/BranchProbabilityInfo/loop.ll
> > >>  test/Analysis/BranchProbabilityInfo/noreturn.ll
> > >>  test/Analysis/BranchProbabilityInfo/pr18705.ll
> > >>  test/Analysis/BranchProbabilityInfo/pr22718.ll
> > >>  test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
> > >>  test/CodeGen/ARM/2013-10-11-select-stalls.ll
> > >>  test/CodeGen/ARM/ifcvt4.ll
> > >>  test/CodeGen/ARM/sjlj-prepare-critical-edge.ll
> > >>  test/CodeGen/ARM/tail-merge-branch-weight.ll
> > >>  test/CodeGen/ARM/test-sharedidx.ll
> > >>  test/CodeGen/Thumb2/thumb2-ifcvt1.ll
> > >>  test/Transforms/SampleProfile/branch.ll
> > >>  test/Transforms/SampleProfile/calls.ll
> > >>  test/Transforms/SampleProfile/discriminator.ll
> > >>  test/Transforms/SampleProfile/fnptr.ll
> > >>  test/Transforms/SampleProfile/propagate.ll
> > >>  unittests/Support/BlockFrequencyTest.cpp
> > >>  unittests/Support/BranchProbabilityTest.cpp
> > >>