[llvm-bugs] [Bug 27219] New: Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!
via llvm-bugs
llvm-bugs at lists.llvm.org
Tue Apr 5 08:18:51 PDT 2016
https://llvm.org/bugs/show_bug.cgi?id=27219
Bug ID: 27219
Summary: Rewriting VMLA.F32 instructions as VMUL+VADD is not a
feature, it's a bug!
Product: libraries
Version: 3.8
Hardware: All
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Backend: ARM
Assignee: unassignedbugs at nondot.org
Reporter: jacob.benoit.1 at gmail.com
CC: llvm-bugs at lists.llvm.org
Classification: Unclassified
Created attachment 16172
--> https://llvm.org/bugs/attachment.cgi?id=16172&action=edit
testcase
For some values of -mcpu, at least -mcpu=cortex-a8 and -mcpu=cortex-a7, LLVM
replaces VMLA.F32 instructions by a (VMUL, VADD) pair.
That much seems to be well-known:
https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ
Apparently, the idea is that on some old Cortex A8 CPUs, there was a
performance problem with VMLA, so replacing it with (VMUL, VADD) was a
work-around for that.
However, that is missing two facts:
Fact #1:
A (VMUL, VADD) pair needs a register to hold the temporary result of the VMUL.
In fully register-tight code making use of all NEON registers, that means
spilling.
Concretely, matrix multiplication (GEMM) kernels are an example of critical
code using all available NEON registers and doing mostly VMLA. That's how I
stumbled upon this bug: Eigen (http://eigen.tuxfamily.org) was generating
unexplainably bad code, with massive register spillage, running 10x slower than
normal.
Eigen needs to know the number of available registers, and whether a
single-instruction multiply-accumulate (thus not requiring an intermediate
temporary register) is available.
https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23
The LLVM behavior of silently replacing VMLA by VMUL+VADD breaks what was
supposed to be an architecture invariant, and breaks Eigen's assumptions.
For now, Eigen works around this by reimplementing the vmlaq_f32 intrinsic in
inline assembly:
https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192
One problem in the microbenchmarks that people have been discussing in the
above llvm mailing list thread, is that they measured isolated VMLA
instructions, not accounting for the side effects on register pressure, which
quickly become dominant in real-world register-tight numerical code.
Fact #2: Most software compiled with this VMLA rewriting, isn't actually
intended to run on a cortex-a8 device specifically.
I'm getting the VMLA rewriting even without passing any -mcpu flag, probably
because -mcpu=cortex-a8 (or some such) is the default:
~/android/toolchains/arm-linux-androideabi-clang3.5/bin/arm-linux-androideabi-clang++
~/vrac/vmlaq_f32_testcase.cc -S -o v.s -march=armv7-a -mfloat-abi=softfp
-mfpu=neon -O3
In this command line, I didn't say that I was interested in cortex-a8, so why
would I be getting a cortex-a8 workaround that's detrimental on every other
device, and potentially catastrophic on register-tight code?
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160405/7eb42c3e/attachment.html>
More information about the llvm-bugs
mailing list