<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Wrong generation of 256/512 bits vperm* from 128 mov"
href="https://bugs.llvm.org/show_bug.cgi?id=40815">40815</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Wrong generation of 256/512 bits vperm* from 128 mov
</td>
</tr>
<tr>
<th>Product</th>
<td>clang
</td>
</tr>
<tr>
<th>Version</th>
<td>7.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>LLVM Codegen
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedclangbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>gael.guennebaud@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk
</td>
</tr></table>
<p>
<div>
<pre>Clang 6 and 7, with -O2 and either AVX or AVX512 wrongly optimize some sequence
of 128 bits load/stores when the source memory has already been loaded in a 256
or 512 bits register.
See the self-contained demo:
<a href="https://godbolt.org/z/oFhMze">https://godbolt.org/z/oFhMze</a>
This issue has been discovered in Eigen
(<a href="http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1684">http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1684</a>). The above demo includes
both a self-contained example and some Eigen-based examples at the bottom.
The problem is much clearer in AVX512 than in AVX as it generates:
vmovaps zmm0, zmmword ptr [rip + .LCPI2_0] # zmm0 =
[3,4,5,6,2,3,4,5,1,2,3,4,0,1,2,3]
vpermps zmm0, zmm0, zmmword ptr [rdi]
instead of:
# zmm0 = [12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3]
vpermps zmm0, zmm0, zmmword ptr [rdi]
(btw, I'm very impressed that it folded all this code to a single vpermps, too
bad its wrong)
With the "trunk" version on godbolt, the issue does not show up as clang/llvm
does not try to generate vperm* but instead it generates a sequence of
vinsert*.
I still reported this issue because:
1- It is not clear whether this issue has been properly identified and is not
simply hidden in trunk waiting to pop-up again.
2- It would be worth fixing the 7 branch.
3- Do you have any suggestion for us to workaround this issue with clang6/7 on
Eigen's side? The only full-proof solution I have so far is to ban
clang6/7+AVX{512} with a #error... That would be extremely bad as this would
mean about x8 slowdowns of matrix products, linear solves and the likes with
clang6/7 on AVX512.
4- Very minor: performance-wise, on AVX512 the vperm approach is usually
significantly faster than a sequence of vinsert, though vperm require a full
cache-line to old the indices.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>