[llvm-branch-commits] [llvm] be4016e - [X86] Fix logic for optimizing movmsk(bitcast(shuffle(x))); PR67287
Tobias Hieta via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Mon Oct 16 01:18:06 PDT 2023
Author: Noah Goldstein
Date: 2023-10-16T10:15:06+02:00
New Revision: be4016e52779e07f0a433a0a1b07ddb1c8938428
URL: https://github.com/llvm/llvm-project/commit/be4016e52779e07f0a433a0a1b07ddb1c8938428
DIFF: https://github.com/llvm/llvm-project/commit/be4016e52779e07f0a433a0a1b07ddb1c8938428.diff
LOG: [X86] Fix logic for optimizing movmsk(bitcast(shuffle(x))); PR67287
Prior logic would remove the shuffle iff all of the elements in `x`
where used. This is incorrect.
The issue is `movmsk` only cares about the highbits, so if the width
of the elements in `x` is smaller than the width of the elements
for the `movmsk`, then the shuffle, even if it preserves all the elements,
may change which ones are used by the highbits.
For example:
`movmsk64(bitcast(shuffle32(x, (1,0,3,2))))`
Even though the shuffle mask `(1,0,3,2)` preserves all the elements, it
flips which will be relevant to the `movmsk64` (x[1] and x[3]
before and x[0] and x[2] after).
The fix here, is to ensure that the shuffle mask can be scaled to the
element width of the `movmsk` instruction. This ensure that the
"high" elements stay "high". This is overly conservative as it
misses cases like `(1,1,3,3)` where the "high" elements stay
intact despite not be scalable, but for an relatively edge-case
optimization that should generally be handled during
simplifyDemandedBits, it seems okay.
(cherry picked from commit 1684c65bc997a8ce0ecf96a493784fe39def75de)
Added:
Modified:
llvm/lib/Target/X86/X86ISelLowering.cpp
llvm/test/CodeGen/X86/movmsk-cmp.ll
Removed:
################################################################################
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 426e3143ac9b212..0f1cb5f1e236656 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -48539,13 +48539,28 @@ static SDValue combineSetCCMOVMSK(SDValue EFLAGS, X86::CondCode &CC,
}
// MOVMSK(SHUFFLE(X,u)) -> MOVMSK(X) iff every element is referenced.
- SmallVector<int, 32> ShuffleMask;
+ // Since we peek through a bitcast, we need to be careful if the base vector
+ // type has smaller elements than the MOVMSK type. In that case, even if
+ // all the elements are demanded by the shuffle mask, only the "high"
+ // elements which have highbits that align with highbits in the MOVMSK vec
+ // elements are actually demanded. A simplification of spurious operations
+ // on the "low" elements take place during other simplifications.
+ //
+ // For example:
+ // MOVMSK64(BITCAST(SHUF32 X, (1,0,3,2))) even though all the elements are
+ // demanded, because we are swapping around the result can change.
+ //
+ // To address this, we check that we can scale the shuffle mask to MOVMSK
+ // element width (this will ensure "high" elements match). Its slightly overly
+ // conservative, but fine for an edge case fold.
+ SmallVector<int, 32> ShuffleMask, ScaledMaskUnused;
SmallVector<SDValue, 2> ShuffleInputs;
if (NumElts <= CmpBits &&
getTargetShuffleInputs(peekThroughBitcasts(Vec), ShuffleInputs,
ShuffleMask, DAG) &&
ShuffleInputs.size() == 1 && !isAnyZeroOrUndef(ShuffleMask) &&
- ShuffleInputs[0].getValueSizeInBits() == VecVT.getSizeInBits()) {
+ ShuffleInputs[0].getValueSizeInBits() == VecVT.getSizeInBits() &&
+ scaleShuffleElements(ShuffleMask, NumElts, ScaledMaskUnused)) {
unsigned NumShuffleElts = ShuffleMask.size();
APInt DemandedElts = APInt::getZero(NumShuffleElts);
for (int M : ShuffleMask) {
diff --git a/llvm/test/CodeGen/X86/movmsk-cmp.ll b/llvm/test/CodeGen/X86/movmsk-cmp.ll
index f7ba49ce0e12732..278a6a8b128eb50 100644
--- a/llvm/test/CodeGen/X86/movmsk-cmp.ll
+++ b/llvm/test/CodeGen/X86/movmsk-cmp.ll
@@ -4458,13 +4458,14 @@ define i32 @PR39665_c_ray_opt(<2 x double> %x, <2 x double> %y) {
define i32 @pr67287(<2 x i64> %broadcast.splatinsert25) {
; SSE2-LABEL: pr67287:
; SSE2: # %bb.0: # %entry
-; SSE2-NEXT: movl $3, %eax
-; SSE2-NEXT: testl %eax, %eax
-; SSE2-NEXT: jne .LBB97_2
-; SSE2-NEXT: # %bb.1: # %entry
; SSE2-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
; SSE2-NEXT: pxor %xmm1, %xmm1
; SSE2-NEXT: pcmpeqd %xmm0, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,0,3,2]
+; SSE2-NEXT: movmskpd %xmm0, %eax
+; SSE2-NEXT: testl %eax, %eax
+; SSE2-NEXT: jne .LBB97_2
+; SSE2-NEXT: # %bb.1: # %entry
; SSE2-NEXT: movd %xmm1, %eax
; SSE2-NEXT: testb $1, %al
; SSE2-NEXT: jne .LBB97_2
More information about the llvm-branch-commits
mailing list