<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Optimize bit-scatter operation"

   href="https://bugs.llvm.org/show_bug.cgi?id=37796">37796</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Optimize bit-scatter operation

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>C++

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>ruiu@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>dgregor@apple.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>I found that clang can't optimize the following code:

  // This function scatter Val's bits as instructed by Mask.

  // Here is an example:

  //

  //  Val:    abcd efgh ijkl mnop

  //  Mask:   1110 0001 1111 0001

  //  Result: hij0 000k lmno 000p

  //

  // Some CPUs support this operation as a single instruction.

  // For example, Intel BMI2 extension has this operation as PDEP.

  static inline uint32_t scatter(uint32_t Val, uint32_t Mask) {

    uint32_t Res = 0;

    uint32_t Off = 0;

    for (uint32_t I = 0; I < 32; ++I)

      if (Mask & (1 << I))

        Res |= !!(Val & (1 << Off++)) << I;

    return Res;

  }

  uint32_t foo(uint32_t x) {

    return scatter(x, 1);

  }

It can be complied to just `andl $1, %edi` on x86-64, but currently clang

compiles this to a loop that iterates 32 times (<a href="https://godbolt.org/g/jX5sNW">https://godbolt.org/g/jX5sNW</a>).

If I add "#pragma unroll", clang can optimize the code

(<a href="https://godbolt.org/g/Apx7Nj">https://godbolt.org/g/Apx7Nj</a>)</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>