<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Generated scatter instructions are slower than scalar version"

   href="https://bugs.llvm.org/show_bug.cgi?id=48429">48429</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Generated scatter instructions are slower than scalar version

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>carrot@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Compile the following code with command line

clang  '--target=x86_64-grtev4-linux-gnu' -maes -m64 -mcx16 -msse4.2 -mpclmul

'-mprefer-vector-width=128' -fexperimental-new-pass-manager

-fsized-deallocation -O3 '-std=gnu++17' -c scatter.cc -save-temps

  __attribute((target("avx,avx2,fma,avx512f,avx512dq,avx512bw"))) void

    foo(int d, const float* ptr, float* dest)

    {

        const float* ptr_end = ptr + d;

        for (; ptr != ptr_end; ++ptr, dest += 16) {

          dest[0] = ptr[-1 * d];

          dest[1] = ptr[0 * d];

          dest[2] = ptr[1 * d];

          dest[3] = ptr[2 * d];

        }

    }

llvm generates 4 element scatters, which is more than 50% slower than scalar

version on my skylake desktop.

The problem is in function int X86TTIImpl::getGatherScatterOpCost(), it has

already found scatter is not profitable if avx512vl is not enabled, so it

should be scalarized, and return a scalarized cost. But the caller

LoopVectorize doesn't know it's a scalarized cost, it thinks it's a scatter

cost, and compares it with a different scalar cost computed by

getMemInstScalarizationCost, and unfortunately X86 backend computed scalar cost

is smaller than LoopVectorize computed scalar cost, so LoopVectorize thinks

scatter is cheaper than scalarize, and generates the slow scatter version.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>