[llvm-bugs] [Bug 27219] New: Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!

Tue Apr 5 08:18:51 PDT 2016

https://llvm.org/bugs/show_bug.cgi?id=27219

            Bug ID: 27219
           Summary: Rewriting VMLA.F32 instructions as VMUL+VADD is not a
                    feature, it's a bug!
           Product: libraries
           Version: 3.8
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: ARM
          Assignee: unassignedbugs at nondot.org
          Reporter: jacob.benoit.1 at gmail.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Created attachment 16172
  --> https://llvm.org/bugs/attachment.cgi?id=16172&action=edit
testcase

For some values of -mcpu, at least -mcpu=cortex-a8 and -mcpu=cortex-a7, LLVM
replaces VMLA.F32 instructions by a (VMUL, VADD) pair.

That much seems to be well-known:
https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ

Apparently, the idea is that on some old Cortex A8 CPUs, there was a
performance problem with VMLA, so replacing it with (VMUL, VADD) was a
work-around for that.

However, that is missing two facts:

Fact #1:

A (VMUL, VADD) pair needs a register to hold the temporary result of the VMUL.
In fully register-tight code making use of all NEON registers, that means
spilling.

Concretely, matrix multiplication (GEMM) kernels are an example of critical
code using all available NEON registers and doing mostly VMLA. That's how I
stumbled upon this bug: Eigen (http://eigen.tuxfamily.org) was generating
unexplainably bad code, with massive register spillage, running 10x slower than
normal.

Eigen needs to know the number of available registers, and whether a
single-instruction multiply-accumulate (thus not requiring an intermediate
temporary register) is available.
https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23

The LLVM behavior of silently replacing VMLA by VMUL+VADD breaks what was
supposed to be an architecture invariant, and breaks Eigen's assumptions.

For now, Eigen works around this by reimplementing the vmlaq_f32 intrinsic in
inline assembly:
https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192

One problem in the microbenchmarks that people have been discussing in the
above llvm mailing list thread, is that they measured isolated VMLA
instructions, not accounting for the side effects on register pressure, which
quickly become dominant in real-world register-tight numerical code.

Fact #2: Most software compiled with this VMLA rewriting, isn't actually
intended to run on a cortex-a8 device specifically.

I'm getting the VMLA rewriting even without passing any -mcpu flag, probably
because -mcpu=cortex-a8 (or some such) is the default:

~/android/toolchains/arm-linux-androideabi-clang3.5/bin/arm-linux-androideabi-clang++
~/vrac/vmlaq_f32_testcase.cc -S -o v.s -march=armv7-a -mfloat-abi=softfp
-mfpu=neon -O3

In this command line, I didn't say that I was interested in cortex-a8, so why
would I be getting a cortex-a8 workaround that's detrimental on every other
device, and potentially catastrophic on register-tight code?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160405/7eb42c3e/attachment.html>