<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - vector ctlz should intelligently switch to scalar algorithm"

   href="https://bugs.llvm.org/show_bug.cgi?id=39672">39672</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>vector ctlz should intelligently switch to scalar algorithm

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>5.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>Macintosh

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>MacOS X

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>danielwatson311@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz

intrinsics can actually be slower than a scalar implementation.

The following code produces a complicated divide-and-conquer algorithm (with

`-O3 -mcpu=sandybridge`):

define <2 x i64> @foo(<2 x i64>) local_unnamed_addr #0 {

    %2 = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %0, i1 true)

    ret <2 x i64> %2

}

Which produces the following assembly:

.LCPI0_0:

  .zero 16,15

.LCPI0_1:

  .byte 4 # 0x4

  .byte 3 # 0x3

  .byte 2 # 0x2

  .byte 2 # 0x2

  .byte 1 # 0x1

  .byte 1 # 0x1

  .byte 1 # 0x1

  .byte 1 # 0x1

  .byte 0 # 0x0

  .byte 0 # 0x0

  .byte 0 # 0x0

  .byte 0 # 0x0

  .byte 0 # 0x0

  .byte 0 # 0x0

  .byte 0 # 0x0

  .byte 0 # 0x0

foo: # @foo

  vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 =

[15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]

  vpand xmm2, xmm0, xmm1

  vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1] # xmm3 =

[4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]

  vpshufb xmm2, xmm3, xmm2

  vpsrlw xmm4, xmm0, 4

  vpand xmm1, xmm4, xmm1

  vpxor xmm4, xmm4, xmm4

  vpcmpeqb xmm5, xmm1, xmm4

  vpand xmm2, xmm2, xmm5

  vpshufb xmm1, xmm3, xmm1

  vpaddb xmm1, xmm2, xmm1

  vpcmpeqb xmm2, xmm0, xmm4

  vpsrlw xmm2, xmm2, 8

  vpand xmm2, xmm1, xmm2

  vpsrlw xmm1, xmm1, 8

  vpaddw xmm1, xmm1, xmm2

  vpcmpeqw xmm2, xmm0, xmm4

  vpsrld xmm2, xmm2, 16

  vpand xmm2, xmm1, xmm2

  vpsrld xmm1, xmm1, 16

  vpaddd xmm1, xmm1, xmm2

  vpcmpeqd xmm0, xmm0, xmm4

  vpsrlq xmm0, xmm0, 32

  vpand xmm0, xmm1, xmm0

  vpsrlq xmm1, xmm1, 32

  vpaddq xmm0, xmm1, xmm0

  ret

The scalar algorithm is merely something like:

foo: # @foo

  vmovdqa xmm0, xmmword ptr [rdi]

  vmovq rax, xmm0

  vpextrq rcx, xmm0, 1

  bsr rax, rax

  xor rax, 63

  bsr rcx, rcx

  xor rcx, 63

  vmovq xmm0, rcx

  vmovq xmm1, rax

  vpunpcklqdq xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]

  vmovdqa xmmword ptr [rdi], xmm0

  ret

compare: <a href="https://godbolt.org/z/ZdW3Yj">https://godbolt.org/z/ZdW3Yj</a>

I have verified on my machine (sandybridge) that the scalar algorithm is

significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.

- MacBook Pro (13-inch, Early 2011)

- macOS 10.13.6</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>