[llvm-bugs] [Bug 26645] New: [LIR] Non-temporal aspect dropped via conversion to memset in some cases
via llvm-bugs
llvm-bugs at lists.llvm.org
Tue Feb 16 22:49:21 PST 2016
https://llvm.org/bugs/show_bug.cgi?id=26645
Bug ID: 26645
Summary: [LIR] Non-temporal aspect dropped via conversion to
memset in some cases
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Loop Optimizer
Assignee: unassignedbugs at nondot.org
Reporter: warren_ristow at playstation.sony.com
CC: llvm-bugs at lists.llvm.org
Classification: Unclassified
Created attachment 15914
--> https://llvm.org/bugs/attachment.cgi?id=15914&action=edit
test.ll
In the C++ test-case below (the associated "test.ll" file is attached), a loop
that clears a block of memory does so with non-temporal stores via the builtin:
__builtin_ia32_movntps(__p, __a);
(This is from the __mm_stream_ps() intrinsic, originally via an include of
<x86intrin.h>.)
Prior to r258620, the stores were done via the non-temporal store instruction
'movntps'. With r258620, the loop is transformed to a memset() call, and so
the non-temporal aspect is lost, causing a performance regression relative to
llvm 3.8.
(Side note: r258620 was reverted at r258703, and an updated version was
re-submitted at r258777. The same behavior happens at r258777, as well as
current ToT (tested r261028).)
It's understood that this is a code-performance issue due to cache pollution.
That is, the correct answer is computed irrespective of whether the
non-temporal store instructions are generated, or whether the memset() call is
used. This is analogous to the situation of bug 19370, where the non-temporal
aspect was lost in some situations, resulting in a performance loss due to
cache pollution.
Note that the loop trip count in the case below is a constant. The loop is of
the form:
for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {
..
}
If the trip-count is a global variable, for example:
unsigned int theSize = sizeof( bigBlock_t );
and the loop is changed to:
for ( unsigned int index = 0; index < theSize; index += 32 ) {
..
}
then the non-temporal store instructions are again produced.
_____________________________________________________________________
$ cat test.cpp
typedef float __m128 __attribute__((__vector_size__(16)));
static __inline__ __m128
__attribute__((__always_inline__, __nodebug__, __target__("sse")))
_mm_setzero_ps(void)
{
return (__m128){ 0, 0, 0, 0 };
}
static __inline__ void
__attribute__((__always_inline__, __nodebug__, __target__("sse")))
_mm_stream_ps(float *__p, __m128 __a)
{
__builtin_ia32_movntps(__p, __a);
}
struct bigBlock_t {
__m128 data[256];
} __attribute__((aligned(128)));
extern void nontemporal_init( bigBlock_t *p );
void nontemporal_init( bigBlock_t *p ) {
float *dst = reinterpret_cast< float * >( p );
__m128 src = _mm_setzero_ps();
for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {
_mm_stream_ps( dst + 0, src );
_mm_stream_ps( dst + 4, src );
dst += 8;
}
}
$
The "test.ll", generated from using the r258619 build as shown below, is
attached:
$ clang++ --version
clang version 3.9.0 (trunk 258619)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ..../llvm/bin
$ clang++ -S -emit-llvm -O0 test.cpp # test.ll created here is attached
$
Using opt/llc from r258619, the 'movntps' instructions can be seen:
$ opt test.ll -O2 -S -o opt.ll # opt from r258619
$ llc opt.ll -o opt.s
$ grep movntps opt.s # 16 'movntps' instructions, since loop is unrolled
movntps %xmm0, (%rdi,%rax)
movntps %xmm0, 16(%rdi,%rax)
movntps %xmm0, 32(%rdi,%rax)
movntps %xmm0, 48(%rdi,%rax)
movntps %xmm0, 64(%rdi,%rax)
movntps %xmm0, 80(%rdi,%rax)
movntps %xmm0, 96(%rdi,%rax)
movntps %xmm0, 112(%rdi,%rax)
movntps %xmm0, 128(%rdi,%rax)
movntps %xmm0, 144(%rdi,%rax)
movntps %xmm0, 160(%rdi,%rax)
movntps %xmm0, 176(%rdi,%rax)
movntps %xmm0, 192(%rdi,%rax)
movntps %xmm0, 208(%rdi,%rax)
movntps %xmm0, 224(%rdi,%rax)
movntps %xmm0, 240(%rdi,%rax)
$ grep memset opt.s
$
Using opt/llc from r258620, the 'movntps' instructions are no longer there, and
instead there is a call to memset():
$ opt test.ll -O2 -S -o opt.ll # opt from r258620
$ llc opt.ll -o opt.s
$ grep movntps opt.s
$ grep memset opt.s
callq memset
$
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160217/531f5fc3/attachment.html>
More information about the llvm-bugs
mailing list