[PATCH] D111460: [X86][LoopVectorize] Fix `LoopVectorizationCostModel::getMemInstScalarizationCost()`

Sat Oct 9 13:05:15 PDT 2021

lebedev.ri added a comment.

In D111460#3053404 <https://reviews.llvm.org/D111460#3053404>, @dmgreen wrote:

> 

@dmgreen thank you for taking a look!

> This looks like the same patch as D93129 <https://reviews.llvm.org/D93129>.

Ah, indeed.

> I don't believe that patch was correct and I don't really believe this is, but the code is super unintuitive and really not well explained anywhere. Let me try and describe it, as far as I understand things.
>
> This just changes the type passes to getAddressComputationCost. That function is called from:
>
> - LoopVectorizationCostModel::getMemoryInstructionCost with a scalar type and no SCEV for the cost of a scalar load/store.
> - LoopVectorizationCostModel::getGatherScatterCost with a vector type and no SCEV for the cost of a gather/scatter
> - LoopVectorizationCostModel::getUniformMemOpCost with a scalar type and no SCEV for a splatted load/store.
> - LoopVectorizationCostModel::getMemInstScalarizationCost in this case, with a vector type and a SCEV. Multiplied by VF.
>
> So "vector type plus SCEV" means scalarized load/stores. The actual implementation of getAddressComputationCost then is:
>
>   InstructionCost X86TTIImpl::getAddressComputationCost(Type *Ty,
>                                                         ScalarEvolution *SE,
>                                                         const SCEV *Ptr) {
>     // Address computations in vectorized code with non-consecutive addresses will
>     // likely result in more instructions compared to scalar code where the
>     // computation can more often be merged into the index mode. The resulting
>     // extra micro-ops can significantly decrease throughput.
>     const unsigned NumVectorInstToHideOverhead = 10;
>   
>     // Cost modeling of Strided Access Computation is hidden by the indexing
>     // modes of X86 regardless of the stride value. We dont believe that there
>     // is a difference between constant strided access in gerenal and constant
>     // strided value which is less than or equal to 64.
>     // Even in the case of (loop invariant) stride whose value is not known at
>     // compile time, the address computation will not incur more than one extra
>     // ADD instruction.
>     if (Ty->isVectorTy() && SE) {
>       if (!BaseT::isStridedAccess(Ptr))
>         return NumVectorInstToHideOverhead;
>       if (!BaseT::getConstantStrideStep(SE, Ptr))
>         return 1;
>     }
>   
>     return BaseT::getAddressComputationCost(Ty, SE, Ptr);
>   }
>
> So the NumVectorInstToHideOverhead is designed specifically for this case - to increase the cost for scalarized loads/stores that are unlikely to be worth vectorizing due to the "significantly decreased throughput". If you want to change that - changing how X86TTIImpl::getAddressComputationCost works is likely your best bet. Otherwise you are just turning that into dead code. The base getAddressComputationCost never returns anything other than 0 and doesn't distinguish between scalar and vector types.
>
> Like I say, this isn't explained very well, but as far as I can see it is "working as intended". The vectorizer asks the backend what the overhead of scalarized loads/stores is, and the backend has answered. Even if it is a strange way to ask.

I see. While it does really look that way, i maintain that this is not intentional and just a bug/leftover from the D27919 <https://reviews.llvm.org/D27919> rework.
To know for sure we'd have to ask the patch's author (@delena), but there is no recent activity on that account, so don't really expect to get an answer.

> I've not seen a lot of evidence that disabling the equivalent code for ARM/AArch64 is a generally good idea, even if there are individual cases where it is an improvement.

> There is no scev for vectors, and so no LSR for vector addresses and pulling address values out of vectors can be very slow. Perhaps that can be improved over time.

Sure. But i'm not really following. Where do we pull address values out of vectors?
This approach (as opposed to D111220 <https://reviews.llvm.org/D111220>) specifically does not result in any vector GEP's.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111460/new/

https://reviews.llvm.org/D111460