[llvm] [AMDGPU] Vectorize i8 Shuffles (PR #95840)

Fri Aug 23 08:10:40 PDT 2024

jrbyrnes wrote:

> code in the vectorizer where this is making a difference? 

Sure -- for this work, there are two main pieces: 1. allowing SLP to consider i8s for vectorization, and 2. preferring vector shuffles for i8

1.

This is accomplished by changes to `getNumberOfParts` and `getMaximumVF`.

The most fundamental use of `TTI.getNumberOfParts` in SLP is in `tryToVectorizeList` ( https://github.com/llvm/llvm-project/blob/3c54aa14aa5f92ea2c96b85efbc7945ed55451e4/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L16884 ). For early exit, it compares the vector size (VF, upwardly bounded by `TTI.getMaximumVF` ) to the `TTI.getNumberOfParts` for the vector type. If these two values are the same, then any vectorization will be undone by scalarization due to type-legalization.

2.

This is accomplished by changes to `getShuffleCost`

The pieces for 1. only consider the type and do not consider the operation which is obviously important (e.g. v2i16s may be legal for a given target, but the target doesn't have a vectorized UDiv operation for i16s). Operation considerations are encoded into the cost model, which SLP considers when calculating the cost of vectorizing (https://github.com/llvm/llvm-project/blob/3c54aa14aa5f92ea2c96b85efbc7945ed55451e4/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L9401 called from https://github.com/llvm/llvm-project/blob/3c54aa14aa5f92ea2c96b85efbc7945ed55451e4/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L10626 ) by comparing the total operation cost of the vectorized version of the code relative to the scalarized version. If vectorization does not offer enough cost improvement, then we don't vectorize. The opcode switch in `getEntryCost` is my usual starting point for reading SLP's use of the cost model.

To be honest, `getShuffleCost` is one of the most commonly used cost queries in SLP, so it's a bit hard to give you an exact pointer for this PR. That said, some rather important invocations are: checking the `ShuffleCost` when comparing cost of scalarized vs vectorized insertelements  ( https://github.com/llvm/llvm-project/blob/3c54aa14aa5f92ea2c96b85efbc7945ed55451e4/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L9721 ), and adding cost when we need to shuffle the vector between operations in a vector chain ( https://github.com/llvm/llvm-project/blob/3c54aa14aa5f92ea2c96b85efbc7945ed55451e4/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L9459 ) or for external users ( https://github.com/llvm/llvm-project/blob/3c54aa14aa5f92ea2c96b85efbc7945ed55451e4/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L10928 ). 

https://github.com/llvm/llvm-project/pull/95840