<html>
    <head>
      <base href="https://llvm.org/bugs/" />
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW --- - Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!"
   href="https://llvm.org/bugs/show_bug.cgi?id=27219">27219</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>3.8
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: ARM
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>jacob.benoit.1@gmail.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr>

        <tr>
          <th>Classification</th>
          <td>Unclassified
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=16172" name="attach_16172" title="testcase">attachment 16172</a> <a href="attachment.cgi?id=16172&action=edit" title="testcase">[details]</a></span>
testcase

For some values of -mcpu, at least -mcpu=cortex-a8 and -mcpu=cortex-a7, LLVM
replaces VMLA.F32 instructions by a (VMUL, VADD) pair.

That much seems to be well-known:
<a href="https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ">https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ</a>

Apparently, the idea is that on some old Cortex A8 CPUs, there was a
performance problem with VMLA, so replacing it with (VMUL, VADD) was a
work-around for that.

However, that is missing two facts:



Fact #1:

A (VMUL, VADD) pair needs a register to hold the temporary result of the VMUL.
In fully register-tight code making use of all NEON registers, that means
spilling.

Concretely, matrix multiplication (GEMM) kernels are an example of critical
code using all available NEON registers and doing mostly VMLA. That's how I
stumbled upon this bug: Eigen (<a href="http://eigen.tuxfamily.org">http://eigen.tuxfamily.org</a>) was generating
unexplainably bad code, with massive register spillage, running 10x slower than
normal.

Eigen needs to know the number of available registers, and whether a
single-instruction multiply-accumulate (thus not requiring an intermediate
temporary register) is available.
<a href="https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23">https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23</a>

The LLVM behavior of silently replacing VMLA by VMUL+VADD breaks what was
supposed to be an architecture invariant, and breaks Eigen's assumptions.

For now, Eigen works around this by reimplementing the vmlaq_f32 intrinsic in
inline assembly:
<a href="https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192">https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192</a>

One problem in the microbenchmarks that people have been discussing in the
above llvm mailing list thread, is that they measured isolated VMLA
instructions, not accounting for the side effects on register pressure, which
quickly become dominant in real-world register-tight numerical code.



Fact #2: Most software compiled with this VMLA rewriting, isn't actually
intended to run on a cortex-a8 device specifically.

I'm getting the VMLA rewriting even without passing any -mcpu flag, probably
because -mcpu=cortex-a8 (or some such) is the default:

~/android/toolchains/arm-linux-androideabi-clang3.5/bin/arm-linux-androideabi-clang++
~/vrac/vmlaq_f32_testcase.cc -S -o v.s -march=armv7-a -mfloat-abi=softfp
-mfpu=neon -O3

In this command line, I didn't say that I was interested in cortex-a8, so why
would I be getting a cortex-a8 workaround that's detrimental on every other
device, and potentially catastrophic on register-tight code?</pre>
        </div>
      </p>
      <hr>
      <span>You are receiving this mail because:</span>
      
      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>