[llvm-bugs] [Bug 41545] New: Byte swap idioms loses optimization on AVX+
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri Apr 19 22:27:14 PDT 2019
https://bugs.llvm.org/show_bug.cgi?id=41545
Bug ID: 41545
Summary: Byte swap idioms loses optimization on AVX+
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: jed at 59a2.org
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, spatel+llvm at rotateright.com
This code correctly optimizes to a simple load on x86-64:
static unsigned read_u32_le(const unsigned char arr[]) {
return (arr[0] << 0)
| (arr[1] << 8)
| (arr[2] << 16)
| (arr[3] << 24);
}
clang -O:
read_u32_le: # @read_u32_le
mov eax, dword ptr [rdi]
ret
However, when allowed to inline into code such as
unsigned sum_buf(int len, const unsigned char *arr) {
unsigned sum = 0;
for (int i=0; i<len; i+=4) {
sum += read_u32_le(arr+i);
}
return sum;
}
on AVX/AVX2, the optimization is lost. For example, with -march=haswell:
.LBB0_5: # =>This Inner Loop Header: Depth=1
vmovdqu xmm7, xmmword ptr [rsi + 4*rax]
vmovdqu xmm0, xmmword ptr [rsi + 4*rax + 16]
vmovdqu xmm1, xmmword ptr [rsi + 4*rax + 32]
vmovdqu xmm2, xmmword ptr [rsi + 4*rax + 48]
vpblendw xmm11, xmm7, xmm8, 170 # xmm11 =
xmm7[0],xmm8[1],xmm7[2],xmm8[3],xmm7[4],xmm8[5],xmm7[6],xmm8[7]
vpblendw xmm12, xmm0, xmm8, 170 # xmm12 =
xmm0[0],xmm8[1],xmm0[2],xmm8[3],xmm0[4],xmm8[5],xmm0[6],xmm8[7]
vpblendw xmm13, xmm1, xmm8, 170 # xmm13 =
xmm1[0],xmm8[1],xmm1[2],xmm8[3],xmm1[4],xmm8[5],xmm1[6],xmm8[7]
vpblendw xmm14, xmm2, xmm8, 170 # xmm14 =
xmm2[0],xmm8[1],xmm2[2],xmm8[3],xmm2[4],xmm8[5],xmm2[6],xmm8[7]
vpand xmm3, xmm7, xmm9
vpor xmm11, xmm11, xmm3
vpand xmm3, xmm0, xmm9
vpor xmm12, xmm12, xmm3
vpand xmm3, xmm1, xmm9
vpor xmm13, xmm13, xmm3
vpand xmm3, xmm2, xmm9
vpor xmm3, xmm14, xmm3
vpand xmm7, xmm7, xmm10
vpor xmm7, xmm11, xmm7
vpaddd xmm15, xmm7, xmm15
vpand xmm0, xmm0, xmm10
vpor xmm0, xmm12, xmm0
vpaddd xmm4, xmm0, xmm4
vpand xmm0, xmm1, xmm10
vpor xmm0, xmm13, xmm0
vpaddd xmm5, xmm0, xmm5
vpand xmm0, xmm2, xmm10
vpor xmm0, xmm3, xmm0
vpaddd xmm6, xmm0, xmm6
add rax, 16
cmp rdi, rax
jne .LBB0_5
Meanwhile, the same inner loop at x86-64 generic:
.LBB0_5: # =>This Inner Loop Header: Depth=1
movdqu xmm2, xmmword ptr [rsi + 4*rax]
paddd xmm0, xmm2
movdqu xmm2, xmmword ptr [rsi + 4*rax + 16]
paddd xmm1, xmm2
add rax, 8
cmp rdi, rax
jne .LBB0_5
https://gcc.godbolt.org/z/Mop93_
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190420/527b0bd3/attachment-0001.html>
More information about the llvm-bugs
mailing list