[llvm-bugs] [Bug 26645] New: [LIR] Non-temporal aspect dropped via conversion to memset in some cases

Tue Feb 16 22:49:21 PST 2016

https://llvm.org/bugs/show_bug.cgi?id=26645

            Bug ID: 26645
           Summary: [LIR] Non-temporal aspect dropped via conversion to
                    memset in some cases
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: warren_ristow at playstation.sony.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Created attachment 15914
  --> https://llvm.org/bugs/attachment.cgi?id=15914&action=edit
test.ll

In the C++ test-case below (the associated "test.ll" file is attached), a loop
that clears a block of memory does so with non-temporal stores via the builtin:

    __builtin_ia32_movntps(__p, __a);

(This is from the __mm_stream_ps() intrinsic, originally via an include of
<x86intrin.h>.)

Prior to r258620, the stores were done via the non-temporal store instruction
'movntps'.  With r258620, the loop is transformed to a memset() call, and so
the non-temporal aspect is lost, causing a performance regression relative to
llvm 3.8.

(Side note: r258620 was reverted at r258703, and an updated version was
re-submitted at r258777.  The same behavior happens at r258777, as well as
current ToT (tested r261028).)

It's understood that this is a code-performance issue due to cache pollution.
That is, the correct answer is computed irrespective of whether the
non-temporal store instructions are generated, or whether the memset() call is
used.  This is analogous to the situation of bug 19370, where the non-temporal
aspect was lost in some situations, resulting in a performance loss due to
cache pollution.

Note that the loop trip count in the case below is a constant.  The loop is of
the form:
    for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {
       ..
    }

If the trip-count is a global variable, for example:

  unsigned int theSize = sizeof( bigBlock_t );

and the loop is changed to:

    for ( unsigned int index = 0; index < theSize; index += 32 ) {
       ..
    }

then the non-temporal store instructions are again produced.
_____________________________________________________________________

  $ cat test.cpp
  typedef float __m128 __attribute__((__vector_size__(16)));

  static __inline__ __m128
  __attribute__((__always_inline__, __nodebug__, __target__("sse")))
  _mm_setzero_ps(void)
  {
    return (__m128){ 0, 0, 0, 0 };
  }

  static __inline__ void
  __attribute__((__always_inline__, __nodebug__, __target__("sse")))
  _mm_stream_ps(float *__p, __m128 __a)
  {
    __builtin_ia32_movntps(__p, __a);
  }

  struct bigBlock_t {
   __m128 data[256];
  } __attribute__((aligned(128)));

  extern void nontemporal_init( bigBlock_t *p );

  void nontemporal_init( bigBlock_t *p ) {
    float *dst = reinterpret_cast< float * >( p );
    __m128 src = _mm_setzero_ps();

    for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {
      _mm_stream_ps( dst + 0, src );
      _mm_stream_ps( dst + 4, src );
      dst += 8;
    }
  }
  $

The "test.ll", generated from using the r258619 build as shown below, is
attached:

  $ clang++ --version
  clang version 3.9.0 (trunk 258619)
  Target: x86_64-unknown-linux-gnu
  Thread model: posix
  InstalledDir: ..../llvm/bin
  $ clang++ -S -emit-llvm -O0 test.cpp  # test.ll created here is attached
  $

Using opt/llc from r258619, the 'movntps' instructions can be seen:

  $ opt test.ll -O2 -S -o opt.ll        # opt from r258619
  $ llc opt.ll -o opt.s
  $ grep movntps opt.s      # 16 'movntps' instructions, since loop is unrolled
          movntps %xmm0, (%rdi,%rax)
          movntps %xmm0, 16(%rdi,%rax)
          movntps %xmm0, 32(%rdi,%rax)
          movntps %xmm0, 48(%rdi,%rax)
          movntps %xmm0, 64(%rdi,%rax)
          movntps %xmm0, 80(%rdi,%rax)
          movntps %xmm0, 96(%rdi,%rax)
          movntps %xmm0, 112(%rdi,%rax)
          movntps %xmm0, 128(%rdi,%rax)
          movntps %xmm0, 144(%rdi,%rax)
          movntps %xmm0, 160(%rdi,%rax)
          movntps %xmm0, 176(%rdi,%rax)
          movntps %xmm0, 192(%rdi,%rax)
          movntps %xmm0, 208(%rdi,%rax)
          movntps %xmm0, 224(%rdi,%rax)
          movntps %xmm0, 240(%rdi,%rax)
  $ grep memset opt.s
  $

Using opt/llc from r258620, the 'movntps' instructions are no longer there, and
instead there is a call to memset():

  $ opt test.ll -O2 -S -o opt.ll        # opt from r258620
  $ llc opt.ll -o opt.s
  $ grep movntps opt.s
  $ grep memset opt.s
          callq   memset
  $

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160217/531f5fc3/attachment.html>