[llvm-bugs] [Bug 39672] New: vector ctlz should intelligently switch to scalar algorithm
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Nov 14 22:16:35 PST 2018
https://bugs.llvm.org/show_bug.cgi?id=39672
Bug ID: 39672
Summary: vector ctlz should intelligently switch to scalar
algorithm
Product: new-bugs
Version: 5.0
Hardware: Macintosh
OS: MacOS X
Status: NEW
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: danielwatson311 at gmail.com
CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org
With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz
intrinsics can actually be slower than a scalar implementation.
The following code produces a complicated divide-and-conquer algorithm (with
`-O3 -mcpu=sandybridge`):
define <2 x i64> @foo(<2 x i64>) local_unnamed_addr #0 {
%2 = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %0, i1 true)
ret <2 x i64> %2
}
Which produces the following assembly:
.LCPI0_0:
.zero 16,15
.LCPI0_1:
.byte 4 # 0x4
.byte 3 # 0x3
.byte 2 # 0x2
.byte 2 # 0x2
.byte 1 # 0x1
.byte 1 # 0x1
.byte 1 # 0x1
.byte 1 # 0x1
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
foo: # @foo
vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 =
[15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
vpand xmm2, xmm0, xmm1
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1] # xmm3 =
[4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
vpshufb xmm2, xmm3, xmm2
vpsrlw xmm4, xmm0, 4
vpand xmm1, xmm4, xmm1
vpxor xmm4, xmm4, xmm4
vpcmpeqb xmm5, xmm1, xmm4
vpand xmm2, xmm2, xmm5
vpshufb xmm1, xmm3, xmm1
vpaddb xmm1, xmm2, xmm1
vpcmpeqb xmm2, xmm0, xmm4
vpsrlw xmm2, xmm2, 8
vpand xmm2, xmm1, xmm2
vpsrlw xmm1, xmm1, 8
vpaddw xmm1, xmm1, xmm2
vpcmpeqw xmm2, xmm0, xmm4
vpsrld xmm2, xmm2, 16
vpand xmm2, xmm1, xmm2
vpsrld xmm1, xmm1, 16
vpaddd xmm1, xmm1, xmm2
vpcmpeqd xmm0, xmm0, xmm4
vpsrlq xmm0, xmm0, 32
vpand xmm0, xmm1, xmm0
vpsrlq xmm1, xmm1, 32
vpaddq xmm0, xmm1, xmm0
ret
The scalar algorithm is merely something like:
foo: # @foo
vmovdqa xmm0, xmmword ptr [rdi]
vmovq rax, xmm0
vpextrq rcx, xmm0, 1
bsr rax, rax
xor rax, 63
bsr rcx, rcx
xor rcx, 63
vmovq xmm0, rcx
vmovq xmm1, rax
vpunpcklqdq xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
vmovdqa xmmword ptr [rdi], xmm0
ret
compare: https://godbolt.org/z/ZdW3Yj
I have verified on my machine (sandybridge) that the scalar algorithm is
significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.
- MacBook Pro (13-inch, Early 2011)
- macOS 10.13.6
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181115/ad5e0573/attachment.html>
More information about the llvm-bugs
mailing list