<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - [LIR] Non-temporal aspect dropped via conversion to memset in some cases"

   href="https://llvm.org/bugs/show_bug.cgi?id=26645">26645</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[LIR] Non-temporal aspect dropped via conversion to memset in some cases

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>warren_ristow@playstation.sony.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=15914" name="attach_15914" title="test.ll">attachment 15914</a> <a href="attachment.cgi?id=15914&action=edit" title="test.ll">[details]</a></span>

test.ll

In the C++ test-case below (the associated "test.ll" file is attached), a loop

that clears a block of memory does so with non-temporal stores via the builtin:

    __builtin_ia32_movntps(__p, __a);

(This is from the __mm_stream_ps() intrinsic, originally via an include of

<x86intrin.h>.)

Prior to r258620, the stores were done via the non-temporal store instruction

'movntps'.  With r258620, the loop is transformed to a memset() call, and so

the non-temporal aspect is lost, causing a performance regression relative to

llvm 3.8.

(Side note: r258620 was reverted at r258703, and an updated version was

re-submitted at r258777.  The same behavior happens at r258777, as well as

current ToT (tested r261028).)

It's understood that this is a code-performance issue due to cache pollution.

That is, the correct answer is computed irrespective of whether the

non-temporal store instructions are generated, or whether the memset() call is

used.  This is analogous to the situation of <a class="bz_bug_link 

          bz_status_RESOLVED  bz_closed"

   title="RESOLVED FIXED - [X86] Non-temporal store from _mm_stream_ps is not mapped to movntps in some cases"

   href="show_bug.cgi?id=19370">bug 19370</a>, where the non-temporal

aspect was lost in some situations, resulting in a performance loss due to

cache pollution.

Note that the loop trip count in the case below is a constant.  The loop is of

the form:

    for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {

       ..

    }

If the trip-count is a global variable, for example:

  unsigned int theSize = sizeof( bigBlock_t );

and the loop is changed to:

    for ( unsigned int index = 0; index < theSize; index += 32 ) {

       ..

    }

then the non-temporal store instructions are again produced.

_____________________________________________________________________

  $ cat test.cpp

  typedef float __m128 __attribute__((__vector_size__(16)));

  static __inline__ __m128

  __attribute__((__always_inline__, __nodebug__, __target__("sse")))

  _mm_setzero_ps(void)

  {

    return (__m128){ 0, 0, 0, 0 };

  }

  static __inline__ void

  __attribute__((__always_inline__, __nodebug__, __target__("sse")))

  _mm_stream_ps(float *__p, __m128 __a)

  {

    __builtin_ia32_movntps(__p, __a);

  }

  struct bigBlock_t {

   __m128 data[256];

  } __attribute__((aligned(128)));

  extern void nontemporal_init( bigBlock_t *p );

  void nontemporal_init( bigBlock_t *p ) {

    float *dst = reinterpret_cast< float * >( p );

    __m128 src = _mm_setzero_ps();

    for ( unsigned int index = 0; index < sizeof( bigBlock_t ); index += 32 ) {

      _mm_stream_ps( dst + 0, src );

      _mm_stream_ps( dst + 4, src );

      dst += 8;

    }

  }

  $

The "test.ll", generated from using the r258619 build as shown below, is

attached:

  $ clang++ --version

  clang version 3.9.0 (trunk 258619)

  Target: x86_64-unknown-linux-gnu

  Thread model: posix

  InstalledDir: ..../llvm/bin

  $ clang++ -S -emit-llvm -O0 test.cpp  # test.ll created here is attached

  $

Using opt/llc from r258619, the 'movntps' instructions can be seen:

  $ opt test.ll -O2 -S -o opt.ll        # opt from r258619

  $ llc opt.ll -o opt.s

  $ grep movntps opt.s      # 16 'movntps' instructions, since loop is unrolled

          movntps %xmm0, (%rdi,%rax)

          movntps %xmm0, 16(%rdi,%rax)

          movntps %xmm0, 32(%rdi,%rax)

          movntps %xmm0, 48(%rdi,%rax)

          movntps %xmm0, 64(%rdi,%rax)

          movntps %xmm0, 80(%rdi,%rax)

          movntps %xmm0, 96(%rdi,%rax)

          movntps %xmm0, 112(%rdi,%rax)

          movntps %xmm0, 128(%rdi,%rax)

          movntps %xmm0, 144(%rdi,%rax)

          movntps %xmm0, 160(%rdi,%rax)

          movntps %xmm0, 176(%rdi,%rax)

          movntps %xmm0, 192(%rdi,%rax)

          movntps %xmm0, 208(%rdi,%rax)

          movntps %xmm0, 224(%rdi,%rax)

          movntps %xmm0, 240(%rdi,%rax)

  $ grep memset opt.s

  $

Using opt/llc from r258620, the 'movntps' instructions are no longer there, and

instead there is a call to memset():

  $ opt test.ll -O2 -S -o opt.ll        # opt from r258620

  $ llc opt.ll -o opt.s

  $ grep movntps opt.s

  $ grep memset opt.s

          callq   memset

  $</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>