[llvm-bugs] [Bug 49658] New: [X86][SSE] Bad optimization of auto-vectorized MULH

Sat Mar 20 00:56:24 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=49658

            Bug ID: 49658
           Summary: [X86][SSE] Bad optimization of auto-vectorized MULH
           Product: libraries
           Version: trunk
          Hardware: Macintosh
                OS: MacOS X
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: tellowkrinkle at gmail.com
                CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
                    llvm-dev at redking.me.uk, pengfei.wang at intel.com,
                    spatel+llvm at rotateright.com

Related: Bug #45897

Was looking at the clang disassembly from libdivide's benchmark code.  This
simplified loop reproduces the issues I was seeing:

define dso_local <8 x i32> @test(i32* nocapture readonly %ptr, i32 %mul)
local_unnamed_addr #1 {
start:
        %t1 = zext i32 %mul to i64
        %t2 = insertelement <8 x i64> undef, i64 %t1, i32 0
        %mulvec = shufflevector <8 x i64> %t2, <8 x i64> undef, <8 x i32>
zeroinitializer
        br label %loop
loop:
        %loopcnt = phi i64 [ 0, %start ], [ %nextcnt, %loop ]
        %sum = phi <8 x i32> [ zeroinitializer, %start ], [ %nextsum, %loop ]
        %ptroff = getelementptr inbounds i32, i32* %ptr, i64 %loopcnt
        %vptroff = bitcast i32* %ptroff to <8 x i32>*
        %v = load <8 x i32>, <8 x i32>* %vptroff, align 4
        %v64 = zext <8 x i32> %v to <8 x i64>
        %vmul = mul nuw <8 x i64> %mulvec, %v64
        %vmulhi = lshr <8 x i64> %vmul, <i64 32, i64 32, i64 32, i64 32, i64
32, i64 32, i64 32, i64 32>
        %vtrunc = trunc <8 x i64> %vmulhi to <8 x i32>
        %nextsum = add <8 x i32> %vtrunc, %sum
        %nextcnt = add i64 %loopcnt, 32
        %isdone = icmp eq i64 %nextcnt, 524288
        br i1 %isdone, label %end, label %loop
end:
        ret <8 x i32> %nextsum
}

Godbolt link: https://llvm.godbolt.org/z/n4T3nT

Expected loop body:
vmovdqa ymm0, [ptr]    ; load data
vpsrlq ymm1, ymm0, 32  ; (or vpshufd) move odd half in place for multiplication
vpmuludq ymm0, ymm0, ymm2 ; (assuming %mulvec is in ymm2) multiply even half
vpmuludq ymm1, ymm1, ymm2 ; multiply odd half
vpsrlq ymm0, ymm0, 32  ; move high words in place for blend
vpblendd ymm0, ymm0, ymm1, 0xaa ; blend high words together
; loopy stuff

Actual loop body:
vpmovzxdq ymm4, [ptr]
vpmovzxdq ymm5, [ptr+16] ; load data
vpmuludq ymm6, ymm1, ymm5
vpmuludq ymm5, ymm2, ymm5
vpsllq  ymm5, ymm5, 32
vpaddq  ymm5, ymm6, ymm5 ; full 64-bit multiply of ymm5
vpmuludq ymm6, ymm1, ymm4
vpmuludq ymm4, ymm2, ymm4
vpsllq  ymm4, ymm4, 32
vpaddq  ymm4, ymm6, ymm4 ; full 64-bit multiply of ymm4
vpermd  ymm4, ymm3, ymm4 ; perm high words from ymm4 to bottom
vpermd  ymm5, ymm3, ymm5 ; perm high words from ymm5 to bottom
vinserti128 ymm4, ymm4, xmm5, 1 ; combine ymm4 and ymm5
; loopy stuff

Note: Line numbers are valid for commit
bdf39e6b0ed4b41a1842ac0193f30a726f8d9f63

Interestingly, if you remove the loop, the code gets much better (not perfect,
but at least not doing full 64-bit multiplies).  It appears this is because the
function combineMulToPMULDQ in X86ISelLowering.cpp:42770 checks if the upper
half of the multiply inputs are all zero, and outputs a single pmuludq for that
case.  This check, however, only works within the current basic block, which is
why it fails with the loop, since one of the operands is constant throughout
the loop and is therefore created in a different basic block.

There's a second (and probably more appropriate) place this could have been
detected, however: combineShiftToMULH in DAGCombiner.cpp:8391.  So why didn't
it?
1. combineShiftToMULH is looking for (srl (mul (zext i32:$a to i64), (zext
i32:$a to i64)), 32) but we have (srl (mul (zext <8 x i32>:$a to <8 x i64>),
(splat (zext i32:$a to i64) to <8 x i64>)), 32).  While the check works fine
for scalar code, it doesn't work nearly as well for vector code where things
like shuffles can be in between the zext and mul without actually affecting
whether the conversion is okay to carry out
2. Like before, combineShiftToMULH is finding itself unable to inspect the
values left for it by previous basic blocks.  This is what it sees:

SelectionDAG has 40 nodes:
  t0: ch = EntryToken
  t4: i64,ch = CopyFromReg t0, Register:i64 %2
  t13: v4i64,ch = CopyFromReg t0, Register:v4i64 %0
  t28: i64 = add nsw t4, Constant:i64<128>
                    t15: v4i64,ch = CopyFromReg t13:1, Register:v4i64 %1
                  t16: v8i64 = concat_vectors t13, t15
                          t2: i64,ch = CopyFromReg t0, Register:i64 %6
                        t5: i64 = add t2, t4
                      t7: i64 = add t5, Constant:i64<2097152>
                    t10: v8i32,ch = load<(load 32 from %ir.scevgep, align 4)>
t0, t7, undef:i64
                  t11: v8i64 = zero_extend t10
                t17: v8i64 = mul nuw t16, t11
                t19: v8i64 = BUILD_VECTOR Constant:i64<32>, Constant:i64<32>,
Constant:i64<32>, Constant:i64<32>, Constant:i64<32>, Constant:i64<32>,
Constant:i64<32>, Constant:i64<32>
              t20: v8i64 = srl t17, t19
            t21: v8i32 = truncate t20
            t23: v8i32,ch = CopyFromReg t0, Register:v8i32 %3
          t24: v8i32 = add t21, t23
        t26: ch = CopyToReg t0, Register:v8i32 %4, t24
        t30: ch = CopyToReg t0, Register:i64 %5, t28
      t35: ch = TokenFactor t26, t30
        t32: i1 = setcc t28, Constant:i64<0>, seteq:ch
      t34: i1 = xor t32, Constant:i1<-1>
    t37: ch = brcond t35, t34, BasicBlock:ch<loop 0x129829748>
  t39: ch = br t37, BasicBlock:ch<end 0x1298298f8>

The zext'd value has been split across t13 and t15, obfuscated by needing to be
concatenated together, in addition to the bigger issue of them being
`CopyFromReg`s with little indication of what's on the other side

Being fairly unfamiliar with LLVM, I don't know what the proper solution here
is.  Is there a way to look through the `CopyFromReg`s?  Should the conversion
to MULHs be done at an earlier stage, in a pass that has access to the whole
function?  Or is the issue of `CopyFromReg`s obfuscating things at this stage a
bigger issue with LLVM's current lowering system?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210320/1838191f/attachment.html>