<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - AVX2 - aggressive broadcast generation instead of memory operands"
href="https://bugs.llvm.org/show_bug.cgi?id=32564">32564</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>AVX2 - aggressive broadcast generation instead of memory operands
</td>
</tr>
<tr>
<th>Product</th>
<td>clang
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Windows NT
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>LLVM Codegen
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedclangbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>regis.portalez@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=18247" name="attach_18247" title="open zip to see code reproducer (clang vs gcc and icc)">attachment 18247</a> <a href="attachment.cgi?id=18247&action=edit" title="open zip to see code reproducer (clang vs gcc and icc)">[details]</a></span>
open zip to see code reproducer (clang vs gcc and icc)
Following resolution of <a class="bz_bug_link
bz_status_RESOLVED bz_closed"
title="RESOLVED FIXED - constant vector value splatted at runtime with broadcast instruction instead of loaded (only core-avx2 target?)"
href="show_bug.cgi?id=20054">bug #20054</a>
(<a class="bz_bug_link
bz_status_RESOLVED bz_closed"
title="RESOLVED FIXED - constant vector value splatted at runtime with broadcast instruction instead of loaded (only core-avx2 target?)"
href="show_bug.cgi?id=20054">https://bugs.llvm.org/show_bug.cgi?id=20054</a>).
llvm codegen (x86 - avx2) now always generates broadcast instructions for splat
values, instead of using memory operands.
See this reproducer :
#include <immintrin.h>
__m256d mulconst(__m256d x) {
const __m256d a = { 15.0, 15.0, 15.0, 15.0 };
return _mm256_mul_pd(x, a);
}
generates [ -O3 -g -S -mavx2 -mavx -mfma ]
.LCPI0_0:
.quad 4624633867356078080 # double 15
mulconst(double __vector(4)): # @mulconst(double
__vector(4))
vbroadcastsd ymm1, qword ptr [rip + .LCPI0_0]
vmulpd ymm0, ymm0, ymm1
ret
This is legitimate when optimizing for code size, but not for speed.
Indeed:
vbroadcastsd is a supplemental instruction,
the result consumes an extra register (which can further generate spilling)
this prevents any use of memory operands, even with inline assembly.
See attached larger reproducer to spot unnecessary spills (and compared
assemble between gcc 6.2 and clang 4.0.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>