[llvm-bugs] [Bug 46695] New: Poor code gen when many loops increment the same vectors
via llvm-bugs
llvm-bugs at lists.llvm.org
Sun Jul 12 12:12:50 PDT 2020
https://bugs.llvm.org/show_bug.cgi?id=46695
Bug ID: 46695
Summary: Poor code gen when many loops increment the same
vectors
Product: libraries
Version: 10.0
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: elrodc at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, spatel+llvm at rotateright.com
Created attachment 23722
--> https://bugs.llvm.org/attachment.cgi?id=23722&action=edit
Dumped llvm module producing the example (same code as in godbolt)
Not-that-minimal-example: https://godbolt.org/z/x8zMq4
This code contains nested loops that sum up 8 accumulators, zmmi where i =
0, 1, 2, 5, 6, 7, 10, 11. These accumulators are ultimately reduced into a
scalar:
vaddpd zmm1, zmm6, zmm1
vaddpd zmm3, zmm7, zmm11
vaddpd zmm1, zmm3, zmm1
vaddpd zmm3, zmm5, zmm10
vaddpd zmm0, zmm0, zmm2
vaddpd zmm0, zmm0, zmm3
vaddpd zmm0, zmm0, zmm1
vextractf64x4 ymm1, zmm0, 1
vaddpd zmm0, zmm0, zmm1
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
These nested loops are organized like
for i in I1
for j in J1
for k in K
; vfmadd to accumulate
end
end
for j in J2
for k in K
; vfmadd to accumulate
end
end
for j in J3
...
end
for I in I2
...
In this example, the depth is 3. I didn't observe the problem when the depth
was 2.
The problem is that for each of these sets of inner most loops, a different
bunch of registers are chosen as the accumulation registers.
Given that there are many loops, this ends up requiring a huge number of
registers and stack space, as well as a large number of move instructions to
enforce their correspondence. Here is an example inner most loop:
.LBB0_36: # %L2043
vbroadcastsd zmm0, qword ptr [rax + 8*rbp + 8]
vbroadcastsd zmm1, qword ptr [r13 + 8*rbp + 8]
vfmadd231pd zmm14, zmm0, zmmword ptr [rsp + 1536] # 64-byte Folded
Reload
vfmadd231pd zmm15, zmm1, zmmword ptr [rsp + 1472] # 64-byte Folded
Reload
vfmadd231pd zmm12, zmm0, zmmword ptr [rsp + 1856] # 64-byte Folded
Reload
vfmadd231pd zmm13, zmm1, zmmword ptr [rsp + 1792] # 64-byte Folded
Reload
vfmadd231pd zmm9, zmm0, zmmword ptr [rsp + 1408] # 64-byte Folded
Reload
vfmadd231pd zmm8, zmm1, zmmword ptr [rsp + 3008] # 64-byte Folded
Reload
vfmadd231pd zmm4, zmm0, zmmword ptr [rsp + 2944] # 64-byte Folded
Reload
vfmadd231pd zmm3, zmm1, zmmword ptr [rsp + 1728] # 64-byte Folded
Reload
vfmadd231pd zmm14, zmm0, zmmword ptr [rsp + 1664] # 64-byte Folded
Reload
vfmadd231pd zmm15, zmm1, zmmword ptr [rsp + 3136] # 64-byte Folded
Reload
vfmadd231pd zmm12, zmm0, zmmword ptr [rsp + 3072] # 64-byte Folded
Reload
vfmadd231pd zmm13, zmm1, zmmword ptr [rsp + 1600] # 64-byte Folded
Reload
vfmadd231pd zmm9, zmm0, zmmword ptr [rsp + 3264] # 64-byte Folded
Reload
vfmadd231pd zmm8, zmm1, zmmword ptr [rsp + 3200] # 64-byte Folded
Reload
vfmadd231pd zmm4 {k1}, zmm0, zmmword ptr [rsp + 3392] # 64-byte
Folded Reload
vfmadd231pd zmm3 {k1}, zmm1, zmmword ptr [rsp + 3328] # 64-byte
Folded Reload
inc rbp
vmovapd zmm19, zmm14
vmovapd zmm18, zmm14
vmovapd zmm17, zmm14
vmovupd zmmword ptr [rsp + 704], zmm14 # 64-byte Spill
vmovupd zmmword ptr [rsp + 576], zmm14 # 64-byte Spill
vmovupd zmmword ptr [rsp + 448], zmm14 # 64-byte Spill
vmovupd zmmword ptr [rsp + 320], zmm14 # 64-byte Spill
vmovupd zmmword ptr [rsp + 192], zmm14 # 64-byte Spill
vmovupd zmmword ptr [rsp + 128], zmm14 # 64-byte Spill
vmovapd zmm0, zmm14
vmovapd zmm29, zmm15
vmovapd zmm27, zmm15
vmovapd zmm26, zmm15
vmovupd zmmword ptr [rsp + 640], zmm15 # 64-byte Spill
vmovupd zmmword ptr [rsp + 512], zmm15 # 64-byte Spill
vmovupd zmmword ptr [rsp + 384], zmm15 # 64-byte Spill
vmovupd zmmword ptr [rsp + 256], zmm15 # 64-byte Spill
vmovupd zmmword ptr [rsp + 64], zmm15 # 64-byte Spill
vmovupd zmmword ptr [rsp], zmm15 # 64-byte Spill
vmovapd zmm2, zmm15
vmovapd zmm30, zmm12
vmovapd zmm28, zmm12
vmovapd zmm31, zmm12
vmovupd zmmword ptr [rsp + 1344], zmm12 # 64-byte Spill
vmovupd zmmword ptr [rsp + 1216], zmm12 # 64-byte Spill
vmovupd zmmword ptr [rsp + 1088], zmm12 # 64-byte Spill
vmovupd zmmword ptr [rsp + 960], zmm12 # 64-byte Spill
vmovupd zmmword ptr [rsp + 896], zmm12 # 64-byte Spill
vmovapd zmm5, zmm12
vmovapd zmm22, zmm13
vmovapd zmm24, zmm13
vmovapd zmm20, zmm13
vmovupd zmmword ptr [rsp + 1280], zmm13 # 64-byte Spill
vmovupd zmmword ptr [rsp + 1152], zmm13 # 64-byte Spill
vmovupd zmmword ptr [rsp + 1024], zmm13 # 64-byte Spill
vmovupd zmmword ptr [rsp + 832], zmm13 # 64-byte Spill
vmovupd zmmword ptr [rsp + 768], zmm13 # 64-byte Spill
vmovapd zmm10, zmm13
vmovapd zmm23, zmm9
vmovapd zmm16, zmm9
vmovupd zmmword ptr [rsp + 2496], zmm9 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2368], zmm9 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2240], zmm9 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2112], zmm9 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2048], zmm9 # 64-byte Spill
vmovapd zmm7, zmm9
vmovapd zmm21, zmm8
vmovapd zmm25, zmm8
vmovupd zmmword ptr [rsp + 2432], zmm8 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2304], zmm8 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2176], zmm8 # 64-byte Spill
vmovupd zmmword ptr [rsp + 1984], zmm8 # 64-byte Spill
vmovupd zmmword ptr [rsp + 1920], zmm8 # 64-byte Spill
vmovapd zmm11, zmm8
vmovupd zmmword ptr [rsp + 2880], zmm4 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2816], zmm4 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2752], zmm4 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2688], zmm4 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2624], zmm4 # 64-byte Spill
vmovupd zmmword ptr [rsp + 2560], zmm4 # 64-byte Spill
vmovapd zmm6, zmm4
vmovapd zmm1, zmm3
cmp rbp, rbx
jl .LBB0_36
Every single vmovupd and vmovapd above is unnecessary and should not exist. The
`vfmadd231pd`s should not be loading from the stack.
Instead of loading from the stack and assigning to `zmm`s
It should be incrementing `zmm`s:
0, 1, 2, 5, 6, 7, 10, 11
Instead, it is loading from the stack, assigning to `zmm`s:
14, 3, 15, 12, 4, 9, 13, 8
and then `vmov(a/u)pd`ing to a huge number of aliasing stack spaces and
registers, including of course `zmm`s they should have been all along:
0, 1, 2, 5, 6, 7, 10, 11
The inner loop should just be the `vbroadcastsd`, `vfmadd231pd`s, and `inc`,
`cmp`, `jl`.
This generated code is over 6 times slower than a version that creates
(zero-initializes) new accumulation vectors for each inner loop, and then adds
these to the final accumulation vectors. That version should be strictly
slower, but it works around this performance-killing bug.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200712/57fe6ec7/attachment.html>
More information about the llvm-bugs
mailing list