[llvm-dev] Where's the optimiser gone (part 12): superfluous 64-bit integer divisions and multiplications

Fri Mar 15 07:52:40 PDT 2019

Pick <https://www.hackersdelight.org/hdcodetxt/divmnu64.c.txt>,
perform some MINOR modifications -- remove the code for the case
n==1, replace nlz() with __builtin_clz() -- then compile it for
a 32-bit machine: see <https://godbolt.org/z/BynK7S>

For a slight variation see <https://godbolt.org/z/DZdn9l>.

Deficiency #1 (lines 35 to 39)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    qhat = (un[j+n]*b + un[j+n-1])/vn[n-1];
    rhat = (un[j+n]*b + un[j+n-1]) - qhat*vn[n-1];

clang does NOT recognize the explicit computation of the remainder
rhat directly following the quotient qhat and generates a call to
__udivdi3() instead of a call to __udivmoddi4() (which __udivdi3()
calls anyway!) which provides both quotient and remainder.

Even after defining the macro OPTIMIZE which produces

    qhat = (un[j+n]*b + un[j+n-1])/vn[n-1];
    rhat = (un[j+n]*b + un[j+n-1])%vn[n-1];

clang gets no clue!

JFTR: GCC and even MSVC recognize this sequence and generate calls
      to __udivmoddi4() and _aulldvrm() respectively!

Deficiency #2 (line 42)
~~~~~~~~~~~~~~~~~~~~~~~

    if (qhat >= b || qhat*vn[n-2] > b*rhat + un[j+n-2])

The second term is evaluated only if qhat is LESS than 2**32, so
a single 32x32-bit multiplication producing a 64-bit product is
sufficient here; clang but performs a 64x64-bit multiplication.

JFTR: with OPTIMIZE defined, clang (like MSVC; GCC but fails)
      generates a single MUL here.

Deficiency #3 (line 51)
~~~~~~~~~~~~~~~~~~~~~~~

The "loop" in lines 42 to 46 leaves qhat LESS than 2**32, so for

    p = qhat*vn[i];

again a single 32x32-bit multiplication producing a 64-bit product
is sufficient; clang but performs a 64x64-bit multiplication.

JFTR: with OPTIMIZE defined, clang (like MSVC and GCC) generates
      a single MUL here.

stay tuned
Stefan Kanthak