<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Thank you for your participation. I have just created a new entry in bugzilla:</div><div class=""><a href="https://bugs.llvm.org/show_bug.cgi?id=32157" class="">https://bugs.llvm.org/show_bug.cgi?id=32157</a></div><div class=""><br class=""></div><div class="">Vit</div><div class=""><br class=""></div><div><blockquote type="cite" class=""><div class="">6 марта 2017 г., в 22:34, Hal Finkel <<a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a>> написал(а):</div><br class="Apple-interchange-newline"><div class="">

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" class="">

  <div bgcolor="#FFFFFF" text="#000000" class=""><p class="">Hi Vit,</p><p class="">Yes, please file a bug report. DAGCombiner::visitFDIV should call

      BuildReciprocalEstimate, and then MachineLICM should hoist the

      reciprocal-generation out of the loop. I imagine that the problem

      here is really in X86TargetLowering::getRecipEstimate, which says

      this:</p><p class="">  // It is likely not profitable to do this for f64 because a

      double-precision<br class="">

        // reciprocal estimate with refinement on x86 prior to FMA

      requires<br class="">

        // 15 instructions: convert to single, rcpss, convert back to

      double, refine<br class="">

        // (3 steps = 12 insts). If an 'rcpsd' variant was added to the

      ISA<br class="">

        // along with FMA, this could be a throughput win.<br class="">

    </p><p class="">It seems like we need to be a bit smarter about this.</p><p class=""> -Hal<br class="">

    </p>

    <div class="moz-cite-prefix">On 03/06/2017 06:11 AM, vit9696 via

      llvm-dev wrote:<br class="">

    </div>

    <blockquote cite="mid:4A869D35-C388-465A-9466-8DE3A7D5F9C7@avp.su" type="cite" class="">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">

      Hello,

      <div class=""><br class="">

      </div>

      <div class="">Recently I have stumbled across an almost double

        performance decrease in one of my codebases when compiling with

        clang instead of gcc. I was testing with the currently latest

        3.9.1 as well as 4.0.0 rc2 targeting x86-64 darwin with -Ofast,

        and after some investigation was able to narrow down the code to

        the following snippet:</div>

      <div class=""><br class="">

      </div>

      <div class="">

        <div class="">__attribute__ ((noinline)) bool calculate(double

          c) {</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">   </span>uint32_t

          b = 0;</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">   </span>uint32_t

          a = 0;</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">   </span>while

          (b < 0xFFFFFFFF) {</div>

        <div class=""><span class="Apple-tab-span" style="white-space: pre;">         </span>if

          ((uint64_t)(b / c) % 2 == 0)</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">                   </span>a++;</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">           </span>b++;</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">   </span>}</div>

        <div class=""><span class="Apple-tab-span" style="white-space:pre">   </span>return

          a;</div>

        <div class="">}</div>

      </div>

      <div class=""><br class="">

      </div>

      <div class="">The culprit here is the division, which is not

        hoisted out of the loop and replaced with a multiplication by

        a reciprocal. From what I understand the current implementation

        of combineRepeatedFPDivisors responsible for floating point

        division conversion does that given that the following

        preconditions are met:</div>

      <div class="">— unsafe math optimisations are enabled;</div>

      <div class="">— the amount of the divisions by the same divisor in

        a single basic block is above the threshold (2 for x86-64).</div>

      <div class=""><br class="">

      </div>

      <div class="">According to the IR the divisions are in two

        separate blocks, and I guess that it must be the reason for the

        optimisation not taking place. This seems to be proven by using

        an even loop count like 0xFFFFFFF0 to avoid a middle check. In

        this case the optimisation works just fine. Here are the IR

        samples: <a moz-do-not-send="true" href="https://ghostbin.com/paste/r4td2" class="">https://ghostbin.com/paste/r4td2</a>.</div>

      <div class=""><br class="">

      </div>

      <div class="">At this point the thoughts arising on my side are:</div>

      <div class="">— Should this be considered a bug and reported to

        llvm-bugs or a desired behaviour?</div>

      <div class="">— If this is considered a bug, how is one supposed

        to fix it? With the exact sample provided the compiler could

        generally predict the amount of iterations being over the

        threshold and optimise the output, however, the real world

        examples (including mine) will unlikely be that simple. Loop

        body would be much larger, the conditions would be more

        complicated, the divisions would likely occur in several

        different basic blocks for various reasons, and I think in

        general it is not possible to predict the iteration count.</div>

      <div class="">

        <div class="">— From what I can tell at least current gcc

          (6.3.0) and icc (17.0.2) are capable of doing that, with both

          this snippet and even when the loop count depends on an

          another argument, which means they are not capable to

          determine whether there would only be 0 iterations or many

          more, however, the optimisation would still be applied (<a moz-do-not-send="true" href="https://ghostbin.com/paste/ttnrz" class="">https://ghostbin.com/paste/ttnrz</a>).

          Perhaps llvm could also be more permissive when it sees a

          loop?</div>

      </div>

      <div class=""><br class="">

      </div>

      <div class="">Best regards,</div>

      <div class="">Vit</div>

      <div class=""><br class="">

      </div>

      <br class="">

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br class="">

      <pre wrap="" class="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

    <br class="">

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </div>

</div></blockquote></div><br class=""></body></html>