[llvm-bugs] [Bug 39967] New: Clang generates slow 64-bit multiply code for NEON

Tue Dec 11 19:06:52 PST 2018

https://bugs.llvm.org/show_bug.cgi?id=39967

            Bug ID: 39967
           Summary: Clang generates slow 64-bit multiply code for NEON
           Product: clang
           Version: 7.0
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: LLVM Codegen
          Assignee: unassignedclangbugs at nondot.org
          Reporter: husseydevin at gmail.com
                CC: llvm-bugs at lists.llvm.org, neeilans at live.com,
                    richard-llvm at metafoo.co.uk

Created attachment 21214
  --> https://bugs.llvm.org/attachment.cgi?id=21214&action=edit
Benchmark sample code

On 32-bit ARMv7a, Clang produces slow code for U64x2 multiplication. 

For example, the xxHash64 routine is much slower than it needs to be. I use a
simplified version of the main loop in the sample code attached.

This expects a unix device, it reads from /dev/urandom for a random output that
can't be precalculated. Don't expect the result to match mine, but expect them
to all match each other.

The first one (nonvec), I forced the target to ARMv7r which was the only
reliable way I found to disable vectorization.

The second one (autovec) is the same code as the first, but without the
vectorization restriction.

The third one (badmult) is with actual vectors, using the built-in multiply
operator.

The last one (goodmult), is the same loop as badmult, but with the multiply
instruction replaced with the optimized intrinsic routine.

U64x2 goodmult(U64x2 top, const U64x2 bot) {
  U32x2 topHi = vshrn_n_u64(top, 32);    // U32x2 topHi  = top >> 32;pl
  U32x2 topLo = vmovn_u64(top);          // U32x2 topLo  = top & 0xFFFFFFFF;
  U32x2 botHi = vshrn_n_u64(bot, 32);    // U32x2 botHi  = bot >> 32;
  U32x2 botLo = vmovn_u64(bot);          // U32x2 botLo  = bot & 0xFFFFFFFF;
  U64x2 prod1 = vmull_u32(topLo, botLo); // U64x2 prod1  = (U64x2)topLo *
botLo;
  U64x2 prod2 = vmull_u32(topHi, botLo); // U64x2 prod2  = (U64x2)topHi *
botLo;
  prod2 = vsraq_n_u64(prod2, prod1, 32); //       prod2 += (prod1 >> 32);
  prod2 =vmlal_u32(prod2, topLo, botHi); //       prod2 += (U64x2)topLo *
botHi;
  return vsliq_n_u64(prod1, prod2, 32);  // return prod1 | (prod2 << 32);
}

Note that Clang does essentially the same thing on SSE4.1. 

This is the result of the attached sample on my LG G3 with Clang 7.0.0
-march=armv7-a -O3 in Termux.

nonvec: 17.237543, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
autovec: 26.295736, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
badmult: 26.307957, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
goodmult: 15.175430, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }

As you can see, the automatically vectorized code is significantly 

You can define ITERS or DATA_SIZE (make sure it is a multiple of 16) to what
you would like.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181212/ebac3d45/attachment.html>