[llvm-bugs] [Bug 39967] New: Clang generates slow 64-bit multiply code for NEON
via llvm-bugs
llvm-bugs at lists.llvm.org
Tue Dec 11 19:06:52 PST 2018
https://bugs.llvm.org/show_bug.cgi?id=39967
Bug ID: 39967
Summary: Clang generates slow 64-bit multiply code for NEON
Product: clang
Version: 7.0
Hardware: All
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: LLVM Codegen
Assignee: unassignedclangbugs at nondot.org
Reporter: husseydevin at gmail.com
CC: llvm-bugs at lists.llvm.org, neeilans at live.com,
richard-llvm at metafoo.co.uk
Created attachment 21214
--> https://bugs.llvm.org/attachment.cgi?id=21214&action=edit
Benchmark sample code
On 32-bit ARMv7a, Clang produces slow code for U64x2 multiplication.
For example, the xxHash64 routine is much slower than it needs to be. I use a
simplified version of the main loop in the sample code attached.
This expects a unix device, it reads from /dev/urandom for a random output that
can't be precalculated. Don't expect the result to match mine, but expect them
to all match each other.
The first one (nonvec), I forced the target to ARMv7r which was the only
reliable way I found to disable vectorization.
The second one (autovec) is the same code as the first, but without the
vectorization restriction.
The third one (badmult) is with actual vectors, using the built-in multiply
operator.
The last one (goodmult), is the same loop as badmult, but with the multiply
instruction replaced with the optimized intrinsic routine.
U64x2 goodmult(U64x2 top, const U64x2 bot) {
U32x2 topHi = vshrn_n_u64(top, 32); // U32x2 topHi = top >> 32;pl
U32x2 topLo = vmovn_u64(top); // U32x2 topLo = top & 0xFFFFFFFF;
U32x2 botHi = vshrn_n_u64(bot, 32); // U32x2 botHi = bot >> 32;
U32x2 botLo = vmovn_u64(bot); // U32x2 botLo = bot & 0xFFFFFFFF;
U64x2 prod1 = vmull_u32(topLo, botLo); // U64x2 prod1 = (U64x2)topLo *
botLo;
U64x2 prod2 = vmull_u32(topHi, botLo); // U64x2 prod2 = (U64x2)topHi *
botLo;
prod2 = vsraq_n_u64(prod2, prod1, 32); // prod2 += (prod1 >> 32);
prod2 =vmlal_u32(prod2, topLo, botHi); // prod2 += (U64x2)topLo *
botHi;
return vsliq_n_u64(prod1, prod2, 32); // return prod1 | (prod2 << 32);
}
Note that Clang does essentially the same thing on SSE4.1.
This is the result of the attached sample on my LG G3 with Clang 7.0.0
-march=armv7-a -O3 in Termux.
nonvec: 17.237543, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
autovec: 26.295736, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
badmult: 26.307957, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
goodmult: 15.175430, result: { 0xd6d2116a54c1f11c, 0xdaeb008208bd6495 }
As you can see, the automatically vectorized code is significantly
You can define ITERS or DATA_SIZE (make sure it is a multiple of 16) to what
you would like.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181212/ebac3d45/attachment.html>
More information about the llvm-bugs
mailing list