[llvm] [SLP] Avoid crash in computeExtractCost (PR #93188)

Thu May 23 06:54:11 PDT 2024

bjope wrote:

> > For a downstream target we ended up in a situation with assertion failures in ShuffleCostEstimator::computeExtractCost.
> > Input IR looked like this:
> > define void @foo(ptr %p0, <64 x i32> %vec) { %p1 = getelementptr i32, ptr %p0, i16 1 %p2 = getelementptr i32, ptr %p0, i16 2 %p3 = getelementptr i32, ptr %p0, i16 3 %p4 = getelementptr i32, ptr %p0, i16 4 %p5 = getelementptr i32, ptr %p0, i16 5 %p6 = getelementptr i32, ptr %p0, i16 6 %p7 = getelementptr i32, ptr %p0, i16 7 %elt = extractelement <64 x i32> %vec, i32 0 store i32 %elt, ptr %p0 store i32 %elt, ptr %p1 store i32 %elt, ptr %p2 store i32 %elt, ptr %p3 store i32 %elt, ptr %p4 store i32 %elt, ptr %p5 store i32 %elt, ptr %p6 store i32 %elt, ptr %p7 ret void }
> > And the scenario was like this:
> > 
> > * VL and Mask has 8 elements at entry to computeExtractCost.
> > * NumParts is 2 (v8i32 is not legal, but v4i32 is legal).
> > * NumElts is calculated as 64 (given by extractelement <64 x i32>).
> > * NumSrcRegs is calculated and is set to 1 (v64i32 is legal).
> > * EltsPerVector is calculated as 64 (given by NumElts/NumSrcRegs).
> > * Assertion failure happens when doing ArrayRef MaskSlice = Mask.slice(Part * EltsPerVector, (Part == NumParts - 1 && Mask.size() % EltsPerVector != 0) ? Mask.size() % EltsPerVector : EltsPerVector); since EltsPerVector is larger than Mask.size() already for Part==0.
> > 
> > This patch resolved the issue by making sure that we slice up Mask in at most EltsPerVector pieces until we have covered the full Mask. When we have covered all elements in Mask we break the loop.
> > Haven't been able to reproduce this scenario for any in-tree target. So unfortunately there is no regression test included in the patch.
> 
> It is just an unexpected situation, where <8 x i32> has 2 parts, while <64 x i32> has just one. Is this correct at all? Actually, all this code must be moved to TTI and calculated there for best cost estimation, this is just a temporary solution.

I actually find the existing TTI.getNumberOfParts helper a bit weird (at least the default BasicTTIImple version). It is implemented as checking for some type legalization cost. Type legalization might involve scalarization/widening/splitting etc. So the number returned is some sort of cost (how many steps involved in legalizing the type?). But SLPVectorizer is for example treating it as NumSrcRegs.
Our target has lots of general purpose registers, and vectors could be mapped to a sequence of such registers. So v4i32 is for example 4 consecutive general purpose registers. Still v4i32 is a legal type, so getNumberOfParts("v4i32") is one.
The target also has a totally different register pool with vector registers. In that pool there are vector registers, for example making v32i32 legal. So getNumberOfParts("v32i32") is also one. Similarly v64i32 is legal, but it is actually two consecutive vector registers, but no split is needed at legalization so getNumberOfParts("v64i32") is also one.

https://github.com/llvm/llvm-project/pull/93188