<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - vector ctlz should intelligently switch to scalar algorithm"
href="https://bugs.llvm.org/show_bug.cgi?id=39672">39672</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>vector ctlz should intelligently switch to scalar algorithm
</td>
</tr>
<tr>
<th>Product</th>
<td>new-bugs
</td>
</tr>
<tr>
<th>Version</th>
<td>5.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>Macintosh
</td>
</tr>
<tr>
<th>OS</th>
<td>MacOS X
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>new bugs
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>danielwatson311@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>With few vector lanes and a large lane-width (i.e. v2i64), LLVM's ctlz
intrinsics can actually be slower than a scalar implementation.
The following code produces a complicated divide-and-conquer algorithm (with
`-O3 -mcpu=sandybridge`):
define <2 x i64> @foo(<2 x i64>) local_unnamed_addr #0 {
%2 = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %0, i1 true)
ret <2 x i64> %2
}
Which produces the following assembly:
.LCPI0_0:
.zero 16,15
.LCPI0_1:
.byte 4 # 0x4
.byte 3 # 0x3
.byte 2 # 0x2
.byte 2 # 0x2
.byte 1 # 0x1
.byte 1 # 0x1
.byte 1 # 0x1
.byte 1 # 0x1
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
.byte 0 # 0x0
foo: # @foo
vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 =
[15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
vpand xmm2, xmm0, xmm1
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1] # xmm3 =
[4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
vpshufb xmm2, xmm3, xmm2
vpsrlw xmm4, xmm0, 4
vpand xmm1, xmm4, xmm1
vpxor xmm4, xmm4, xmm4
vpcmpeqb xmm5, xmm1, xmm4
vpand xmm2, xmm2, xmm5
vpshufb xmm1, xmm3, xmm1
vpaddb xmm1, xmm2, xmm1
vpcmpeqb xmm2, xmm0, xmm4
vpsrlw xmm2, xmm2, 8
vpand xmm2, xmm1, xmm2
vpsrlw xmm1, xmm1, 8
vpaddw xmm1, xmm1, xmm2
vpcmpeqw xmm2, xmm0, xmm4
vpsrld xmm2, xmm2, 16
vpand xmm2, xmm1, xmm2
vpsrld xmm1, xmm1, 16
vpaddd xmm1, xmm1, xmm2
vpcmpeqd xmm0, xmm0, xmm4
vpsrlq xmm0, xmm0, 32
vpand xmm0, xmm1, xmm0
vpsrlq xmm1, xmm1, 32
vpaddq xmm0, xmm1, xmm0
ret
The scalar algorithm is merely something like:
foo: # @foo
vmovdqa xmm0, xmmword ptr [rdi]
vmovq rax, xmm0
vpextrq rcx, xmm0, 1
bsr rax, rax
xor rax, 63
bsr rcx, rcx
xor rcx, 63
vmovq xmm0, rcx
vmovq xmm1, rax
vpunpcklqdq xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
vmovdqa xmmword ptr [rdi], xmm0
ret
compare: <a href="https://godbolt.org/z/ZdW3Yj">https://godbolt.org/z/ZdW3Yj</a>
I have verified on my machine (sandybridge) that the scalar algorithm is
significantly faster (up to 2x) for v2i64, v2i32, and similar for v4i32, v4i64.
- MacBook Pro (13-inch, Early 2011)
- macOS 10.13.6</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>