[llvm-dev] combineRepeatedFPDivisors design questions

Mon Mar 6 11:34:30 PST 2017

Hi Vit,

Yes, please file a bug report. DAGCombiner::visitFDIV should call 
BuildReciprocalEstimate, and then MachineLICM should hoist the 
reciprocal-generation out of the loop. I imagine that the problem here 
is really in X86TargetLowering::getRecipEstimate, which says this:

   // It is likely not profitable to do this for f64 because a 
double-precision
   // reciprocal estimate with refinement on x86 prior to FMA requires
   // 15 instructions: convert to single, rcpss, convert back to double, 
refine
   // (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA
   // along with FMA, this could be a throughput win.

It seems like we need to be a bit smarter about this.

  -Hal

On 03/06/2017 06:11 AM, vit9696 via llvm-dev wrote:
> Hello,
>
> Recently I have stumbled across an almost double performance decrease 
> in one of my codebases when compiling with clang instead of gcc. I was 
> testing with the currently latest 3.9.1 as well as 4.0.0 rc2 targeting 
> x86-64 darwin with -Ofast, and after some investigation was able to 
> narrow down the code to the following snippet:
>
> __attribute__ ((noinline)) bool calculate(double c) {
> uint32_t b = 0;
> uint32_t a = 0;
> while (b < 0xFFFFFFFF) {
> if ((uint64_t)(b / c) % 2 == 0)
> a++;
> b++;
> }
> return a;
> }
>
> The culprit here is the division, which is not hoisted out of the loop 
> and replaced with a multiplication by a reciprocal. From what I 
> understand the current implementation of combineRepeatedFPDivisors 
> responsible for floating point division conversion does that given 
> that the following preconditions are met:
> — unsafe math optimisations are enabled;
> — the amount of the divisions by the same divisor in a single basic 
> block is above the threshold (2 for x86-64).
>
> According to the IR the divisions are in two separate blocks, and I 
> guess that it must be the reason for the optimisation not taking 
> place. This seems to be proven by using an even loop count 
> like 0xFFFFFFF0 to avoid a middle check. In this case the optimisation 
> works just fine. Here are the IR samples: 
> https://ghostbin.com/paste/r4td2.
>
> At this point the thoughts arising on my side are:
> — Should this be considered a bug and reported to llvm-bugs or a 
> desired behaviour?
> — If this is considered a bug, how is one supposed to fix it? With the 
> exact sample provided the compiler could generally predict the amount 
> of iterations being over the threshold and optimise the output, 
> however, the real world examples (including mine) will unlikely be 
> that simple. Loop body would be much larger, the conditions would be 
> more complicated, the divisions would likely occur in several 
> different basic blocks for various reasons, and I think in general it 
> is not possible to predict the iteration count.
> — From what I can tell at least current gcc (6.3.0) and icc (17.0.2) 
> are capable of doing that, with both this snippet and even when the 
> loop count depends on an another argument, which means they are not 
> capable to determine whether there would only be 0 iterations or many 
> more, however, the optimisation would still be applied 
> (https://ghostbin.com/paste/ttnrz). Perhaps llvm could also be more 
> permissive when it sees a loop?
>
> Best regards,
> Vit
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170306/6d652794/attachment-0001.html>