[llvm-bugs] [Bug 39672] New: vector ctlz should intelligently switch to scalar algorithm

Wed Nov 14 22:16:35 PST 2018

https://bugs.llvm.org/show_bug.cgi?id=39672

            Bug ID: 39672
           Summary: vector ctlz should intelligently switch to scalar
                    algorithm
           Product: new-bugs
           Version: 5.0
          Hardware: Macintosh
                OS: MacOS X
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: danielwatson311 at gmail.com
                CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org

With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz
intrinsics can actually be slower than a scalar implementation.

The following code produces a complicated divide-and-conquer algorithm (with
`-O3 -mcpu=sandybridge`):

define <2 x i64> @foo(<2 x i64>) local_unnamed_addr #0 {
    %2 = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %0, i1 true)
    ret <2 x i64> %2
}

Which produces the following assembly:

.LCPI0_0:
  .zero 16,15
.LCPI0_1:
  .byte 4 # 0x4
  .byte 3 # 0x3
  .byte 2 # 0x2
  .byte 2 # 0x2
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 1 # 0x1
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
  .byte 0 # 0x0
foo: # @foo
  vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 =
[15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
  vpand xmm2, xmm0, xmm1
  vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1] # xmm3 =
[4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
  vpshufb xmm2, xmm3, xmm2
  vpsrlw xmm4, xmm0, 4
  vpand xmm1, xmm4, xmm1
  vpxor xmm4, xmm4, xmm4
  vpcmpeqb xmm5, xmm1, xmm4
  vpand xmm2, xmm2, xmm5
  vpshufb xmm1, xmm3, xmm1
  vpaddb xmm1, xmm2, xmm1
  vpcmpeqb xmm2, xmm0, xmm4
  vpsrlw xmm2, xmm2, 8
  vpand xmm2, xmm1, xmm2
  vpsrlw xmm1, xmm1, 8
  vpaddw xmm1, xmm1, xmm2
  vpcmpeqw xmm2, xmm0, xmm4
  vpsrld xmm2, xmm2, 16
  vpand xmm2, xmm1, xmm2
  vpsrld xmm1, xmm1, 16
  vpaddd xmm1, xmm1, xmm2
  vpcmpeqd xmm0, xmm0, xmm4
  vpsrlq xmm0, xmm0, 32
  vpand xmm0, xmm1, xmm0
  vpsrlq xmm1, xmm1, 32
  vpaddq xmm0, xmm1, xmm0
  ret

The scalar algorithm is merely something like:

foo: # @foo
  vmovdqa xmm0, xmmword ptr [rdi]
  vmovq rax, xmm0
  vpextrq rcx, xmm0, 1
  bsr rax, rax
  xor rax, 63
  bsr rcx, rcx
  xor rcx, 63
  vmovq xmm0, rcx
  vmovq xmm1, rax
  vpunpcklqdq xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
  vmovdqa xmmword ptr [rdi], xmm0
  ret

compare: https://godbolt.org/z/ZdW3Yj

I have verified on my machine (sandybridge) that the scalar algorithm is
significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.

- MacBook Pro (13-inch, Early 2011)
- macOS 10.13.6

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181115/ad5e0573/attachment.html>