[llvm-dev] combineRepeatedFPDivisors design questions
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Mon Mar 6 11:34:30 PST 2017
Hi Vit,
Yes, please file a bug report. DAGCombiner::visitFDIV should call
BuildReciprocalEstimate, and then MachineLICM should hoist the
reciprocal-generation out of the loop. I imagine that the problem here
is really in X86TargetLowering::getRecipEstimate, which says this:
// It is likely not profitable to do this for f64 because a
double-precision
// reciprocal estimate with refinement on x86 prior to FMA requires
// 15 instructions: convert to single, rcpss, convert back to double,
refine
// (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA
// along with FMA, this could be a throughput win.
It seems like we need to be a bit smarter about this.
-Hal
On 03/06/2017 06:11 AM, vit9696 via llvm-dev wrote:
> Hello,
>
> Recently I have stumbled across an almost double performance decrease
> in one of my codebases when compiling with clang instead of gcc. I was
> testing with the currently latest 3.9.1 as well as 4.0.0 rc2 targeting
> x86-64 darwin with -Ofast, and after some investigation was able to
> narrow down the code to the following snippet:
>
> __attribute__ ((noinline)) bool calculate(double c) {
> uint32_t b = 0;
> uint32_t a = 0;
> while (b < 0xFFFFFFFF) {
> if ((uint64_t)(b / c) % 2 == 0)
> a++;
> b++;
> }
> return a;
> }
>
> The culprit here is the division, which is not hoisted out of the loop
> and replaced with a multiplication by a reciprocal. From what I
> understand the current implementation of combineRepeatedFPDivisors
> responsible for floating point division conversion does that given
> that the following preconditions are met:
> — unsafe math optimisations are enabled;
> — the amount of the divisions by the same divisor in a single basic
> block is above the threshold (2 for x86-64).
>
> According to the IR the divisions are in two separate blocks, and I
> guess that it must be the reason for the optimisation not taking
> place. This seems to be proven by using an even loop count
> like 0xFFFFFFF0 to avoid a middle check. In this case the optimisation
> works just fine. Here are the IR samples:
> https://ghostbin.com/paste/r4td2.
>
> At this point the thoughts arising on my side are:
> — Should this be considered a bug and reported to llvm-bugs or a
> desired behaviour?
> — If this is considered a bug, how is one supposed to fix it? With the
> exact sample provided the compiler could generally predict the amount
> of iterations being over the threshold and optimise the output,
> however, the real world examples (including mine) will unlikely be
> that simple. Loop body would be much larger, the conditions would be
> more complicated, the divisions would likely occur in several
> different basic blocks for various reasons, and I think in general it
> is not possible to predict the iteration count.
> — From what I can tell at least current gcc (6.3.0) and icc (17.0.2)
> are capable of doing that, with both this snippet and even when the
> loop count depends on an another argument, which means they are not
> capable to determine whether there would only be 0 iterations or many
> more, however, the optimisation would still be applied
> (https://ghostbin.com/paste/ttnrz). Perhaps llvm could also be more
> permissive when it sees a loop?
>
> Best regards,
> Vit
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170306/6d652794/attachment-0001.html>
More information about the llvm-dev
mailing list