<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Loop-idiom recognition for memset in the inner-loop of a nested-loop interferes with vectorization"

   href="https://bugs.llvm.org/show_bug.cgi?id=32854">32854</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Loop-idiom recognition for memset in the inner-loop of a nested-loop interferes with vectorization

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>brycelelbach@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=18382" name="attach_18382" title="Reduced Test Case">attachment 18382</a> <a href="attachment.cgi?id=18382&action=edit" title="Reduced Test Case">[details]</a></span>

Reduced Test Case

Compilation options, build environment, etc are documented in the attached file

and here:

<a href="https://wandbox.org/permlink/o06VeIxCKC1qIhUh">https://wandbox.org/permlink/o06VeIxCKC1qIhUh</a>

Summary: We have a nested loop like this (where A is a double* __restrict__):

    for (ptrdiff_t j = 0; j != N; ++j)

        for (ptrdiff_t i = 0; i != N; ++i)

            A[i + j * N] = 0.0F;

Loop-idiom recognition determines that it can replace the inner loop with

memset, turning the code into:

    for (ptrdiff_t j = 0; j != N; ++j)

        std::memset(A + j * N, 0, sizeof(double) * N); // e.g. @llvm.memset

Later, the vectorizer sees this code and decides to bail out because it cannot

vectorize the inserted call to @llvm.memset.

I have so many questions here :)

0.) The diagnostic that the vectorizer pass remarks give is not very helpful:

'call instruction cannot be vectorized', BUT the source location it points to

isn't a call - it's the users original code. Many users may not divine the fact

that loop-idiom replacement occured and end up fruitfully trying to figure out

why assignment to double (the source location pointed to) is a call that cannot

be vectorized. At the very least, the pass remark (emitted from here:

<a href="https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Vectorize/LoopVectorize.cpp#L5422">https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Vectorize/LoopVectorize.cpp#L5422</a>)

could give the name of the function in the function call that could not be

vectorized (which I assume would be something like "memset" or "@llvm.memset"

in this case).

1.) Why is there not a vector version of @llvm.memset in addition to the scalar

version? Is this a problem with the underlying C library on my target (x86

Linux)?

2.) Why does the vectorizer give up when it encounters a scalar function call?

If the function is noexcept, it should be able to take something like this:

    // Assume A is an cache-line aligned double* __restrict__

    // and N is divisible by some nice number, say 32. 

    for (ptrdiff_t i = 0; i != N; ++i)

    {

        double tmp = scalar_noexcept_f(i);

        A[i] += B[i] * tmp;

    }

And turn it into something like this:

    // Assume A is an cache-line aligned double* __restrict__

    // and N is divisible by some nice number, say 32. 

    for (ptrdiff_t i = 0; i != N; i += 8)

    {

        // Vectorize "around" the scalar call.

        __m512d tmp = _mm512_set_pd(

            scalar_noexcept_f(i)

          , scalar_noexcept_f(i+1)

          , scalar_noexcept_f(i+2)

          , scalar_noexcept_f(i+3)

          , scalar_noexcept_f(i+4)

          , scalar_noexcept_f(i+5)

          , scalar_noexcept_f(i+6)

          , scalar_noexcept_f(i+7)

        );

        _mm512_store_pd(

            A + i

          , _mm512_fmadd_pd(

                _mm512_load_pd(A + i)

              , _mm512_load_pd(B + i)

              , tmp

            )

        );

    }

3.) Why isn't loop-idiom recognition "nested loop aware"? In this case, my

nested loops could be turned into a single memset.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>