<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - [LIR] Non-temporal aspect dropped via conversion to memset in some cases"
href="https://llvm.org/bugs/show_bug.cgi?id=26645">26645</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[LIR] Non-temporal aspect dropped via conversion to memset in some cases
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Loop Optimizer
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>warren_ristow@playstation.sony.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=15914" name="attach_15914" title="test.ll">attachment 15914</a> <a href="attachment.cgi?id=15914&action=edit" title="test.ll">[details]</a></span>
test.ll
In the C++ test-case below (the associated "test.ll" file is attached), a loop
that clears a block of memory does so with non-temporal stores via the builtin:
__builtin_ia32_movntps(__p, __a);
(This is from the __mm_stream_ps() intrinsic, originally via an include of
<x86intrin.h>.)
Prior to r258620, the stores were done via the non-temporal store instruction
'movntps'. With r258620, the loop is transformed to a memset() call, and so
the non-temporal aspect is lost, causing a performance regression relative to
llvm 3.8.
(Side note: r258620 was reverted at r258703, and an updated version was
re-submitted at r258777. The same behavior happens at r258777, as well as
current ToT (tested r261028).)
It's understood that this is a code-performance issue due to cache pollution.
That is, the correct answer is computed irrespective of whether the
non-temporal store instructions are generated, or whether the memset() call is
used. This is analogous to the situation of <a class="bz_bug_link
bz_status_RESOLVED bz_closed"
title="RESOLVED FIXED - [X86] Non-temporal store from _mm_stream_ps is not mapped to movntps in some cases"
href="show_bug.cgi?id=19370">bug 19370</a>, where the non-temporal
aspect was lost in some situations, resulting in a performance loss due to
cache pollution.
Note that the loop trip count in the case below is a constant. The loop is of
the form:
for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {
..
}
If the trip-count is a global variable, for example:
unsigned int theSize = sizeof( bigBlock_t );
and the loop is changed to:
for ( unsigned int index = 0; index < theSize; index += 32 ) {
..
}
then the non-temporal store instructions are again produced.
_____________________________________________________________________
$ cat test.cpp
typedef float __m128 __attribute__((__vector_size__(16)));
static __inline__ __m128
__attribute__((__always_inline__, __nodebug__, __target__("sse")))
_mm_setzero_ps(void)
{
return (__m128){ 0, 0, 0, 0 };
}
static __inline__ void
__attribute__((__always_inline__, __nodebug__, __target__("sse")))
_mm_stream_ps(float *__p, __m128 __a)
{
__builtin_ia32_movntps(__p, __a);
}
struct bigBlock_t {
__m128 data[256];
} __attribute__((aligned(128)));
extern void nontemporal_init( bigBlock_t *p );
void nontemporal_init( bigBlock_t *p ) {
float *dst = reinterpret_cast< float * >( p );
__m128 src = _mm_setzero_ps();
for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {
_mm_stream_ps( dst + 0, src );
_mm_stream_ps( dst + 4, src );
dst += 8;
}
}
$
The "test.ll", generated from using the r258619 build as shown below, is
attached:
$ clang++ --version
clang version 3.9.0 (trunk 258619)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ..../llvm/bin
$ clang++ -S -emit-llvm -O0 test.cpp # test.ll created here is attached
$
Using opt/llc from r258619, the 'movntps' instructions can be seen:
$ opt test.ll -O2 -S -o opt.ll # opt from r258619
$ llc opt.ll -o opt.s
$ grep movntps opt.s # 16 'movntps' instructions, since loop is unrolled
movntps %xmm0, (%rdi,%rax)
movntps %xmm0, 16(%rdi,%rax)
movntps %xmm0, 32(%rdi,%rax)
movntps %xmm0, 48(%rdi,%rax)
movntps %xmm0, 64(%rdi,%rax)
movntps %xmm0, 80(%rdi,%rax)
movntps %xmm0, 96(%rdi,%rax)
movntps %xmm0, 112(%rdi,%rax)
movntps %xmm0, 128(%rdi,%rax)
movntps %xmm0, 144(%rdi,%rax)
movntps %xmm0, 160(%rdi,%rax)
movntps %xmm0, 176(%rdi,%rax)
movntps %xmm0, 192(%rdi,%rax)
movntps %xmm0, 208(%rdi,%rax)
movntps %xmm0, 224(%rdi,%rax)
movntps %xmm0, 240(%rdi,%rax)
$ grep memset opt.s
$
Using opt/llc from r258620, the 'movntps' instructions are no longer there, and
instead there is a call to memset():
$ opt test.ll -O2 -S -o opt.ll # opt from r258620
$ llc opt.ll -o opt.s
$ grep movntps opt.s
$ grep memset opt.s
callq memset
$</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>