<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Optimize bit-scatter operation"
   href="https://bugs.llvm.org/show_bug.cgi?id=37796">37796</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Optimize bit-scatter operation
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>clang
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>C++
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedclangbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>ruiu@google.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>dgregor@apple.com, llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>I found that clang can't optimize the following code:

  // This function scatter Val's bits as instructed by Mask.
  // Here is an example:
  //
  //  Val:    abcd efgh ijkl mnop
  //  Mask:   1110 0001 1111 0001
  //  Result: hij0 000k lmno 000p
  //
  // Some CPUs support this operation as a single instruction.
  // For example, Intel BMI2 extension has this operation as PDEP.
  static inline uint32_t scatter(uint32_t Val, uint32_t Mask) {
    uint32_t Res = 0;
    uint32_t Off = 0;

    for (uint32_t I = 0; I < 32; ++I)
      if (Mask & (1 << I))
        Res |= !!(Val & (1 << Off++)) << I;
    return Res;
  }

  uint32_t foo(uint32_t x) {
    return scatter(x, 1);
  }

It can be complied to just `andl $1, %edi` on x86-64, but currently clang
compiles this to a loop that iterates 32 times (<a href="https://godbolt.org/g/jX5sNW">https://godbolt.org/g/jX5sNW</a>).

If I add "#pragma unroll", clang can optimize the code
(<a href="https://godbolt.org/g/Apx7Nj">https://godbolt.org/g/Apx7Nj</a>)</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>