[llvm-bugs] [Bug 49658] New: [X86][SSE] Bad optimization of auto-vectorized MULH
via llvm-bugs
llvm-bugs at lists.llvm.org
Sat Mar 20 00:56:24 PDT 2021
https://bugs.llvm.org/show_bug.cgi?id=49658
Bug ID: 49658
Summary: [X86][SSE] Bad optimization of auto-vectorized MULH
Product: libraries
Version: trunk
Hardware: Macintosh
OS: MacOS X
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: tellowkrinkle at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, pengfei.wang at intel.com,
spatel+llvm at rotateright.com
Related: Bug #45897
Was looking at the clang disassembly from libdivide's benchmark code. This
simplified loop reproduces the issues I was seeing:
define dso_local <8 x i32> @test(i32* nocapture readonly %ptr, i32 %mul)
local_unnamed_addr #1 {
start:
%t1 = zext i32 %mul to i64
%t2 = insertelement <8 x i64> undef, i64 %t1, i32 0
%mulvec = shufflevector <8 x i64> %t2, <8 x i64> undef, <8 x i32>
zeroinitializer
br label %loop
loop:
%loopcnt = phi i64 [ 0, %start ], [ %nextcnt, %loop ]
%sum = phi <8 x i32> [ zeroinitializer, %start ], [ %nextsum, %loop ]
%ptroff = getelementptr inbounds i32, i32* %ptr, i64 %loopcnt
%vptroff = bitcast i32* %ptroff to <8 x i32>*
%v = load <8 x i32>, <8 x i32>* %vptroff, align 4
%v64 = zext <8 x i32> %v to <8 x i64>
%vmul = mul nuw <8 x i64> %mulvec, %v64
%vmulhi = lshr <8 x i64> %vmul, <i64 32, i64 32, i64 32, i64 32, i64
32, i64 32, i64 32, i64 32>
%vtrunc = trunc <8 x i64> %vmulhi to <8 x i32>
%nextsum = add <8 x i32> %vtrunc, %sum
%nextcnt = add i64 %loopcnt, 32
%isdone = icmp eq i64 %nextcnt, 524288
br i1 %isdone, label %end, label %loop
end:
ret <8 x i32> %nextsum
}
Godbolt link: https://llvm.godbolt.org/z/n4T3nT
Expected loop body:
vmovdqa ymm0, [ptr] ; load data
vpsrlq ymm1, ymm0, 32 ; (or vpshufd) move odd half in place for multiplication
vpmuludq ymm0, ymm0, ymm2 ; (assuming %mulvec is in ymm2) multiply even half
vpmuludq ymm1, ymm1, ymm2 ; multiply odd half
vpsrlq ymm0, ymm0, 32 ; move high words in place for blend
vpblendd ymm0, ymm0, ymm1, 0xaa ; blend high words together
; loopy stuff
Actual loop body:
vpmovzxdq ymm4, [ptr]
vpmovzxdq ymm5, [ptr+16] ; load data
vpmuludq ymm6, ymm1, ymm5
vpmuludq ymm5, ymm2, ymm5
vpsllq ymm5, ymm5, 32
vpaddq ymm5, ymm6, ymm5 ; full 64-bit multiply of ymm5
vpmuludq ymm6, ymm1, ymm4
vpmuludq ymm4, ymm2, ymm4
vpsllq ymm4, ymm4, 32
vpaddq ymm4, ymm6, ymm4 ; full 64-bit multiply of ymm4
vpermd ymm4, ymm3, ymm4 ; perm high words from ymm4 to bottom
vpermd ymm5, ymm3, ymm5 ; perm high words from ymm5 to bottom
vinserti128 ymm4, ymm4, xmm5, 1 ; combine ymm4 and ymm5
; loopy stuff
Note: Line numbers are valid for commit
bdf39e6b0ed4b41a1842ac0193f30a726f8d9f63
Interestingly, if you remove the loop, the code gets much better (not perfect,
but at least not doing full 64-bit multiplies). It appears this is because the
function combineMulToPMULDQ in X86ISelLowering.cpp:42770 checks if the upper
half of the multiply inputs are all zero, and outputs a single pmuludq for that
case. This check, however, only works within the current basic block, which is
why it fails with the loop, since one of the operands is constant throughout
the loop and is therefore created in a different basic block.
There's a second (and probably more appropriate) place this could have been
detected, however: combineShiftToMULH in DAGCombiner.cpp:8391. So why didn't
it?
1. combineShiftToMULH is looking for (srl (mul (zext i32:$a to i64), (zext
i32:$a to i64)), 32) but we have (srl (mul (zext <8 x i32>:$a to <8 x i64>),
(splat (zext i32:$a to i64) to <8 x i64>)), 32). While the check works fine
for scalar code, it doesn't work nearly as well for vector code where things
like shuffles can be in between the zext and mul without actually affecting
whether the conversion is okay to carry out
2. Like before, combineShiftToMULH is finding itself unable to inspect the
values left for it by previous basic blocks. This is what it sees:
SelectionDAG has 40 nodes:
t0: ch = EntryToken
t4: i64,ch = CopyFromReg t0, Register:i64 %2
t13: v4i64,ch = CopyFromReg t0, Register:v4i64 %0
t28: i64 = add nsw t4, Constant:i64<128>
t15: v4i64,ch = CopyFromReg t13:1, Register:v4i64 %1
t16: v8i64 = concat_vectors t13, t15
t2: i64,ch = CopyFromReg t0, Register:i64 %6
t5: i64 = add t2, t4
t7: i64 = add t5, Constant:i64<2097152>
t10: v8i32,ch = load<(load 32 from %ir.scevgep, align 4)>
t0, t7, undef:i64
t11: v8i64 = zero_extend t10
t17: v8i64 = mul nuw t16, t11
t19: v8i64 = BUILD_VECTOR Constant:i64<32>, Constant:i64<32>,
Constant:i64<32>, Constant:i64<32>, Constant:i64<32>, Constant:i64<32>,
Constant:i64<32>, Constant:i64<32>
t20: v8i64 = srl t17, t19
t21: v8i32 = truncate t20
t23: v8i32,ch = CopyFromReg t0, Register:v8i32 %3
t24: v8i32 = add t21, t23
t26: ch = CopyToReg t0, Register:v8i32 %4, t24
t30: ch = CopyToReg t0, Register:i64 %5, t28
t35: ch = TokenFactor t26, t30
t32: i1 = setcc t28, Constant:i64<0>, seteq:ch
t34: i1 = xor t32, Constant:i1<-1>
t37: ch = brcond t35, t34, BasicBlock:ch<loop 0x129829748>
t39: ch = br t37, BasicBlock:ch<end 0x1298298f8>
The zext'd value has been split across t13 and t15, obfuscated by needing to be
concatenated together, in addition to the bigger issue of them being
`CopyFromReg`s with little indication of what's on the other side
Being fairly unfamiliar with LLVM, I don't know what the proper solution here
is. Is there a way to look through the `CopyFromReg`s? Should the conversion
to MULHs be done at an earlier stage, in a pass that has access to the whole
function? Or is the issue of `CopyFromReg`s obfuscating things at this stage a
bigger issue with LLVM's current lowering system?
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210320/1838191f/attachment.html>
More information about the llvm-bugs
mailing list