[PATCH] D12603: Use fixed-point representation for BranchProbability

Sat Sep 12 15:07:08 PDT 2015

A little correction: 0.88% is for UINT32_MAX case, while 0.93% is for
0x80000000.

David

On Sat, Sep 12, 2015 at 3:02 PM, Xinliang David Li <davidxl at google.com>
wrote:

>
>
> On Fri, Sep 11, 2015 at 5:34 PM, Duncan P. N. Exon Smith <
> dexonsmith at apple.com> wrote:
>
>>
>> > On 2015-Sep-10, at 14:15, Xinliang David Li <davidxl at google.com> wrote:
>> >
>> > Cong,  It is more common to use power of 2 as the scaling factor in
>> > fixed point representation for computation efficiency.  For 32bit
>> > width, 0x80000000 is common (noted as Q31).  Here are the benefits
>> >
>> > 1) it is much faster to do the scale operation with Q31 -- instead of
>> > using the current slow version of scale, it can call a faster version
>> > to be introduced.
>> >
>> > This will probably help resolve bug
>> > https://llvm.org/bugs/show_bug.cgi?id=24620 where 20% of the time is
>> > spent in the slow version of the scale method.
>>
>> Is there a way to speed up scaling with UINT32_MAX?  I suspect so, but
>> haven't worked it out.  (E.g., you could scale to UINT64_C(1) << 32 and
>> ignore the first dither; if there's no dithering, subtract 1 from the
>> last.)
>>
>
> The slowdown does not come from scaling  that forms the fixed point value,
> but from:
>
> static uint64_t scale(uint64_t Num, uint32_t N, uint32_t D) {
>   ....
> }
>
> where the main contributor is the DIV/REM operations used in this method.
>
> My profiling data shows that if DIV/REM operations are  replaced with
> shift/bitwise_and operation (when D is power of 2), the runtime of the
> method can be sped up by 6X.
>
> If D is UINT32_MAX and if the constant value is actually exposed to the
> compiler (manually or through const prop and cloning which is done by GCC,
> not Clang), the instructor selector will convert the DIV/REM into IMUL and
> SHR, which can also get lots of speedup, but compared with the power of 2
> case, this version of 'scale' is still 1.6X slower.
>
> In other words, in terms of computation efficiency, Power of 2 Denominator
> has huge advantage, the implementation of Scale is also more readable.
>
>
>
>>
>> > 2) it allows negative fractional number to be represented
>>
>> Not relevant for probabilities, though.
>>
>> > 3) it is more precise for lots of common cases. For instance,
>> > representation of 0.5, 0.25, 0.75 (i.e. 1/2, 1/4, 3/4) etc are
>> > actually lossless with Q31 while the current choice of using
>> > uint32_max can not represent them precisely.
>>
>> It's less precise for lots of common cases, too.  2^32's only prime
>> factor is 2, but 2^32-1 has 3, 5, 17, etc.
>>
>> This argument seems weak either way.
>>
>
> I have done some experiments to compare the precision of two choices.
> Here is how the experiment is set up:
>
> 1) first collect a set of unique fractional numbers N/D, where D increases
> from 2 to 100, and N increases from 1 to D-1. Skip N/D if GCD(D,N) is not
> 1.  The total number of ratios is 3403.
>
> 2)
>      for any each value V in set [100, 1000000]
>         for each (N,D) pair collected in 1)
>              2.1) compute the reference scaled value for V
>                  scale_val_ref =  scale(V, N, D)
>              2.2)  compute the fixed point rep for N/D:
>                   FixedPoint<SF> fp (N, D);
>              2.3) compute scaled value  for V:
>                   scale_val = scale(V, fp.N, SF)
>
>              2.4) compare scale_val and scale_val_ref and collect
> statistics
>
> In the above code, SF = 0x80000000 and UINT32_MAX
>
> In step 2.2) above, there are total ~3.4 billion scaled value computed.
> The experiment will collect the total number of scaling operations where
> scale_val != scale_val_ref for each SF. The max diff of scale_val, and
> scal_val_ref is also recorded.
>
> Here is the result. In both cases, the max diff is 1.
> When SF = 0x80000000, there are about 0.88% of total scaling operations
> that produce different result from the reference value;
> When SF = UINT32_MAX, there are about 0.93% of total scaling results that
> differ from the references.
>
> In short, the difference between the above two cases is 0.05%
>
> Given the small precision difference, I tend to believe 0x80000000 is
> slightly better (given its efficiency and simplicity for div/rem ops).
>
> Regarding debugging print, you mentioned using 0xffffffff as denominator
> is easier to track. Is there a reason for that? I find large numbers are
> equally hard to track mentally.
>
> thanks,
>
> David
>
>
>>
>> >
>> > (if needed in the future, we can also extend this to be a template
>> > class where number of bits in the scaling factor is a template
>> > parameter to support values > 1
>> >   template <int bits> class FixedPoint {
>> >     static const int ScalingFactor = 1 << bits;
>> > };
>> > )
>>
>> Sure, applies to either.
>>
>> >
>> > In the IR dump, I find it is easier to read
>> > 1) percentage,
>> > 2) M/N form when N is less than 10.
>>
>> I think we should always print (and check) for a percentage, but if you're
>> debugging something, you need to see exactly what's in memory, and the
>> hex is precise and easy to reason about.
>>
>> >
>> > Current dump of 0x10ffff/0xffffffff form is hard to read -- can then
>> > be simplified to make the denominator < 10?
>> >
>> >
>> > Duncan, what is your opinion?
>>
>> If I'm going to be tracking down some bug by looking at DEBUG prints, I
>> suspect I'll find 0xffffffff easier to reason about than 0x80000000.  So
>> if we can keep UINT32_MAX, I'd prefer that.
>>
>> >
>> > thanks,
>> >
>> > David
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Sep 10, 2015 at 12:21 PM, Cong Hou <congh at google.com> wrote:
>> >> congh updated this revision to Diff 34474.
>> >> congh added a comment.
>> >>
>> >> Update the patch by fixing block frequency / branch probability unit
>> tests failures.
>> >>
>> >> I am not sure if this is the right way to update the unit tests. The
>> fixed-point representation has precision issues comparing to the previous
>> rational representation and that is why we need to update the comparison
>> right answers in unit tests.
>> >>
>> >>
>> >> http://reviews.llvm.org/D12603
>> >>
>> >> Files:
>> >>  include/llvm/Analysis/BlockFrequencyInfoImpl.h
>> >>  include/llvm/Support/BranchProbability.h
>> >>  lib/Support/BranchProbability.cpp
>> >>  lib/Target/ARM/ARMBaseInstrInfo.cpp
>> >>  test/Analysis/BlockFrequencyInfo/basic.ll
>> >>  test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll
>> >>  test/Analysis/BranchProbabilityInfo/basic.ll
>> >>  test/Analysis/BranchProbabilityInfo/loop.ll
>> >>  test/Analysis/BranchProbabilityInfo/noreturn.ll
>> >>  test/Analysis/BranchProbabilityInfo/pr18705.ll
>> >>  test/Analysis/BranchProbabilityInfo/pr22718.ll
>> >>  test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
>> >>  test/CodeGen/ARM/2013-10-11-select-stalls.ll
>> >>  test/CodeGen/ARM/ifcvt4.ll
>> >>  test/CodeGen/ARM/sjlj-prepare-critical-edge.ll
>> >>  test/CodeGen/ARM/tail-merge-branch-weight.ll
>> >>  test/CodeGen/ARM/test-sharedidx.ll
>> >>  test/CodeGen/Thumb2/thumb2-ifcvt1.ll
>> >>  test/Transforms/SampleProfile/branch.ll
>> >>  test/Transforms/SampleProfile/calls.ll
>> >>  test/Transforms/SampleProfile/discriminator.ll
>> >>  test/Transforms/SampleProfile/fnptr.ll
>> >>  test/Transforms/SampleProfile/propagate.ll
>> >>  unittests/Support/BlockFrequencyTest.cpp
>> >>  unittests/Support/BranchProbabilityTest.cpp
>> >>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150912/f643c093/attachment.html>