<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Wrong generation of 256/512 bits vperm* from 128 mov"

   href="https://bugs.llvm.org/show_bug.cgi?id=40815">40815</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Wrong generation of 256/512 bits vperm* from 128 mov

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>7.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>LLVM Codegen

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>gael.guennebaud@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Clang 6 and 7, with -O2 and either AVX or AVX512 wrongly optimize some sequence

of 128 bits load/stores when the source memory has already been loaded in a 256

or 512 bits register.

See the self-contained demo:

  <a href="https://godbolt.org/z/oFhMze">https://godbolt.org/z/oFhMze</a>

This issue has been discovered in Eigen

(<a href="http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1684">http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1684</a>). The above demo includes

both a self-contained example and some Eigen-based examples at the bottom.

The problem is much clearer in AVX512 than in AVX as it generates:

  vmovaps zmm0, zmmword ptr [rip + .LCPI2_0] # zmm0 =

[3,4,5,6,2,3,4,5,1,2,3,4,0,1,2,3]

  vpermps zmm0, zmm0, zmmword ptr [rdi]

instead of:

  # zmm0 = [12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3]

  vpermps zmm0, zmm0, zmmword ptr [rdi]

(btw, I'm very impressed that it folded all this code to a single vpermps, too

bad its wrong)

With the "trunk" version on godbolt, the issue does not show up as clang/llvm

does not try to generate vperm* but instead it generates a sequence of

vinsert*.

I still reported this issue because:

1- It is not clear whether this issue has been properly identified and is not

simply hidden in trunk waiting to pop-up again.

2- It would be worth fixing the 7 branch.

3- Do you have any suggestion for us to workaround this issue with clang6/7 on

Eigen's side? The only full-proof solution I have so far is to ban

clang6/7+AVX{512} with a #error... That would be extremely bad as this would

mean about x8 slowdowns of matrix products, linear solves and the likes with

clang6/7 on AVX512.

4- Very minor: performance-wise, on AVX512 the vperm approach is usually

significantly faster than a sequence of vinsert, though vperm require a full

cache-line to old the indices.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>