<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!"

   href="https://llvm.org/bugs/show_bug.cgi?id=27219">27219</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>3.8

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: ARM

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>jacob.benoit.1@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=16172" name="attach_16172" title="testcase">attachment 16172</a> <a href="attachment.cgi?id=16172&action=edit" title="testcase">[details]</a></span>

testcase

For some values of -mcpu, at least -mcpu=cortex-a8 and -mcpu=cortex-a7, LLVM

replaces VMLA.F32 instructions by a (VMUL, VADD) pair.

That much seems to be well-known:

<a href="https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ">https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ</a>

Apparently, the idea is that on some old Cortex A8 CPUs, there was a

performance problem with VMLA, so replacing it with (VMUL, VADD) was a

work-around for that.

However, that is missing two facts:

Fact #1:

A (VMUL, VADD) pair needs a register to hold the temporary result of the VMUL.

In fully register-tight code making use of all NEON registers, that means

spilling.

Concretely, matrix multiplication (GEMM) kernels are an example of critical

code using all available NEON registers and doing mostly VMLA. That's how I

stumbled upon this bug: Eigen (<a href="http://eigen.tuxfamily.org">http://eigen.tuxfamily.org</a>) was generating

unexplainably bad code, with massive register spillage, running 10x slower than

normal.

Eigen needs to know the number of available registers, and whether a

single-instruction multiply-accumulate (thus not requiring an intermediate

temporary register) is available.

<a href="https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23">https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23</a>

The LLVM behavior of silently replacing VMLA by VMUL+VADD breaks what was

supposed to be an architecture invariant, and breaks Eigen's assumptions.

For now, Eigen works around this by reimplementing the vmlaq_f32 intrinsic in

inline assembly:

<a href="https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192">https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192</a>

One problem in the microbenchmarks that people have been discussing in the

above llvm mailing list thread, is that they measured isolated VMLA

instructions, not accounting for the side effects on register pressure, which

quickly become dominant in real-world register-tight numerical code.

Fact #2: Most software compiled with this VMLA rewriting, isn't actually

intended to run on a cortex-a8 device specifically.

I'm getting the VMLA rewriting even without passing any -mcpu flag, probably

because -mcpu=cortex-a8 (or some such) is the default:

~/android/toolchains/arm-linux-androideabi-clang3.5/bin/arm-linux-androideabi-clang++

~/vrac/vmlaq_f32_testcase.cc -S -o v.s -march=armv7-a -mfloat-abi=softfp

-mfpu=neon -O3

In this command line, I didn't say that I was interested in cortex-a8, so why

would I be getting a cortex-a8 workaround that's detrimental on every other

device, and potentially catastrophic on register-tight code?</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>