[PATCH] D128302: [AArch64][CostModel] Detects that ExtractElement at index is not free in AArch64 when result is used as integer.

Thu Aug 25 00:27:09 PDT 2022

mingmingl added a comment.

In D128302#3743563 <https://reviews.llvm.org/D128302#3743563>, @davidxl wrote:

> I guess fhahn@ meant that for all useful scenarios, the instruction needs to have some uses (otherwise they are dead).  For this case, the check should probably be tightened : instead of checking if the instruction has uses, it needs to examine the uses and differentiate integer uses that requires a fmov and other use contexts (which makes the extract noop).

Added a FIXME to estimate extract/insert cost with use-def contexts.

> Checking use emptiness can pessimize cases where the cost is zero, but it returns non-zero cost.

In D128302#3731045 <https://reviews.llvm.org/D128302#3731045>, @mingmingl wrote:

> In D128302#3725318 <https://reviews.llvm.org/D128302#3725318>, @fhahn wrote:
>
>> Just a general comment: please try to include the full context when uploading a diff: https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface
>
> Yeah I will remember to do this from now on (already did it earlier this week upon suggestions from another reviewer, and a partial diff was purely unintentional.) Also will hold on updating the patch to keep the patch context for discussion (leave all comments open).

Uploaded with full context this time.

I'd like to point out the current patch still lacks test coverage for "virtual" vector instructions in `print<cost-model> pass, due to the fact that instructions are physical in an IR. Will upload to get preliminary feedback on the overall direction -> it's feasible to get back test coverage, by calling op-based (not instruction based) `getVectorInstrCost` in `print<cost-model>` pass (the callsite of op-based cost API could be gated by a boolean option and turned on where it's needed).

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1993
+    //
+    // The cost of extracting a scalar element from a vector register depends
+    // on how scalar will be used:
----------------
RKSimon wrote:
> mingmingl wrote:
> > fhahn wrote:
> > > I might be missing something here, but the comment here doesn't seem to match the code, which just checks if the result is used. I'd expect that in all interesting cases the result will be used. And if it is not used the cost is not really that interesting, as it is effectively a no-op that should be removed.
> > It makes complete sense that instructions without users are not interesting, and [1] is sufficient for this patch (will hold on updating diff for better context). Let me elaborate why this patch and it'd be great to hear your thoughts!  //(This is a lot of words despite efforts to simplify them.. hopefully structure / bold fonts makes it easier. Thanks for reviewing this patch and reading the replies!) //
> > 
> > About the motivating use case
> > 
> > - It is shown in `LICM/AArch64/extract-element.ll` -> an extract-element instruction shouldn't be considered free (and thereby [[ https://github.com/llvm/llvm-project/blob/a4230319f7af786f4d947f39a870f339b32a6fe7/llvm/lib/Transforms/Scalar/LICM.cpp#L575-L580 | sinked to loop exit ]]) by LICM when cost kind is `TCK_SizeAndLatency` -> extract-element has size and latency cost, unless it could be "combined" into its user instructions (e.g., user instruction can access lane directly). 
> > - Say the loop-exit is frequently executed in a hot path, sink `extract-element` (along with a series of its def instructions) is inefficient. 
> > - Redundancy elimination might be an alternative option but I haven't got a chance to look into the alternative option yet, with the thought that fixing cost model is overall a good change as well :)*
> > 
> > About the motivation of this patch
> > * When the instruction exists in IR (indicated by function parameter `const Instruction& I`) and extract result is integer (a stronger signal of a `fmov` being emitted), this patch use the same cost all lanes (not considering lane 0 free). 
> > 
> > So why this patch wants to tell a real vector instruction from a virtual one (by asking for `const Instruction&I`, not just `Opcode`)?
> > * Vectorizer passes have `Opcode` but not real instructions in its provisioning stage [2]. Not telling real from virtual may make vectorizer result better or worse in a fragile way. 
> > * Plus, a new interface of `getVectorInstCost` won't mess up the codebase (w.r.t API/code style) when improvements for vectorizer cost models are added in the future.
> > 
> > But why vectorizer gives better or worse result in a fragile way?
> > -  Take `BaseTTIImpl::getScalarizationOverhead` as an example, it calls `getVectorInstrCost` [[ https://github.com/llvm/llvm-project/blob/515ece1a9005b29747e5e19bccb17f2ae3d78fb9/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L722-L727 | element-by-element ]] for virtual instructions. **//With various SIMD instructions, element-by-element estimation is not accurate for AArch64//**.
> > - AArch64TTI doesn't override this method so `BaseTTIImpl` is used. Just as a reference, x86 backend overrides this method for some specialized cases.
> > - If this patch is //NOT// no-op for method `getScalarizationOverhead`, it could exacerbates the inaccurate cost model (SLP, etc) and misleads vectorizers.
> > - **In practice, when non-zero cost for lane 0 is applied for "provisioned" vector instructions, some IR from `llvm/test/Transforms/<vectorizer-subdir>/AArch64/` gets better generated code and some gets worse generated code**. I didn't inspect them all, but presumably there are going to be better or worse cases, since without this change vectorizers sometimes gives worse output (e.g., https://godbolt.org/z/c1bjhGdT5 is for llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll)
> > 
> > A related question, I wonder if it's with a reason (or just unintended status quo) that `getVectorInstrCost` interface doesn't ask for a `CostKind` parameter? 
> > 
> > [1]
> > 
> > ```
> > +InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
> > +                                                   Type *Val, unsigned Index) {
> > +  InstructionCost cost = getVectorInstrCost(I.getOpcode(), Val, Index);
> > +
> > +  // According to NEON programmer guide, other than multiply instructions,
> > +  // instructions that access scalars can access any element in the vector register.
> > +  //
> > +  // The cost of an extract depends on whether the user instruction.
> > +  // 1. If users could use scalars in vector registers directly (e.g., store), the
> > +  // extract-element operation is essentially free.
> > +  // 2. If the user instruction requires core register as operand (i.e.,
> > +  // cannot use scalars in vector register), an explicit move operation will
> > +  // be codegen'd.
> > +  //
> > +  // 'cost' might be an optimistic 0 when lane is 0.
> > +  // Returns the base cost (as the same with other lanes) if the scalar is an integer.
> > +  //
> > +  // FIXME:
> > +  // The cost of extract-element from vector is better estimated with context (e.g. users).
> > +  if (!isa_and_nonnull<ExtractElementInst>(&I) || !Val->getScalarType()->isIntegerTy()) 
> > +    return ST->getVectorInsertExtractBaseCost();
> > +  return cost;
> > +}
> > +
> > ```
> > [2]  Basically, vectorizer pass sees one form (scalar or vector) of "physical" instructions (that do exists in the IR), and tries to "provision" the cost of the other form; this means instructions of the other form doesn't exist yet, and only Opcode is known and used to provision the cost.
> > A related question, I wonder if it's with a reason (or just unintended status quo) that getVectorInstrCost interface doesn't ask for a CostKind parameter?
> 
> This is on my TODO list that was waiting for the getInstructionCost refactor (D79483) - but I want to get some other CostKind changes (notably D132216) out of the way first.
Thanks for making the cost-model changes! (mark this as done).

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1993
+    //
+    // The cost of extracting a scalar element from a vector register depends
+    // on how scalar will be used:
----------------
mingmingl wrote:
> RKSimon wrote:
> > mingmingl wrote:
> > > fhahn wrote:
> > > > I might be missing something here, but the comment here doesn't seem to match the code, which just checks if the result is used. I'd expect that in all interesting cases the result will be used. And if it is not used the cost is not really that interesting, as it is effectively a no-op that should be removed.
> > > It makes complete sense that instructions without users are not interesting, and [1] is sufficient for this patch (will hold on updating diff for better context). Let me elaborate why this patch and it'd be great to hear your thoughts!  //(This is a lot of words despite efforts to simplify them.. hopefully structure / bold fonts makes it easier. Thanks for reviewing this patch and reading the replies!) //
> > > 
> > > About the motivating use case
> > > 
> > > - It is shown in `LICM/AArch64/extract-element.ll` -> an extract-element instruction shouldn't be considered free (and thereby [[ https://github.com/llvm/llvm-project/blob/a4230319f7af786f4d947f39a870f339b32a6fe7/llvm/lib/Transforms/Scalar/LICM.cpp#L575-L580 | sinked to loop exit ]]) by LICM when cost kind is `TCK_SizeAndLatency` -> extract-element has size and latency cost, unless it could be "combined" into its user instructions (e.g., user instruction can access lane directly). 
> > > - Say the loop-exit is frequently executed in a hot path, sink `extract-element` (along with a series of its def instructions) is inefficient. 
> > > - Redundancy elimination might be an alternative option but I haven't got a chance to look into the alternative option yet, with the thought that fixing cost model is overall a good change as well :)*
> > > 
> > > About the motivation of this patch
> > > * When the instruction exists in IR (indicated by function parameter `const Instruction& I`) and extract result is integer (a stronger signal of a `fmov` being emitted), this patch use the same cost all lanes (not considering lane 0 free). 
> > > 
> > > So why this patch wants to tell a real vector instruction from a virtual one (by asking for `const Instruction&I`, not just `Opcode`)?
> > > * Vectorizer passes have `Opcode` but not real instructions in its provisioning stage [2]. Not telling real from virtual may make vectorizer result better or worse in a fragile way. 
> > > * Plus, a new interface of `getVectorInstCost` won't mess up the codebase (w.r.t API/code style) when improvements for vectorizer cost models are added in the future.
> > > 
> > > But why vectorizer gives better or worse result in a fragile way?
> > > -  Take `BaseTTIImpl::getScalarizationOverhead` as an example, it calls `getVectorInstrCost` [[ https://github.com/llvm/llvm-project/blob/515ece1a9005b29747e5e19bccb17f2ae3d78fb9/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L722-L727 | element-by-element ]] for virtual instructions. **//With various SIMD instructions, element-by-element estimation is not accurate for AArch64//**.
> > > - AArch64TTI doesn't override this method so `BaseTTIImpl` is used. Just as a reference, x86 backend overrides this method for some specialized cases.
> > > - If this patch is //NOT// no-op for method `getScalarizationOverhead`, it could exacerbates the inaccurate cost model (SLP, etc) and misleads vectorizers.
> > > - **In practice, when non-zero cost for lane 0 is applied for "provisioned" vector instructions, some IR from `llvm/test/Transforms/<vectorizer-subdir>/AArch64/` gets better generated code and some gets worse generated code**. I didn't inspect them all, but presumably there are going to be better or worse cases, since without this change vectorizers sometimes gives worse output (e.g., https://godbolt.org/z/c1bjhGdT5 is for llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll)
> > > 
> > > A related question, I wonder if it's with a reason (or just unintended status quo) that `getVectorInstrCost` interface doesn't ask for a `CostKind` parameter? 
> > > 
> > > [1]
> > > 
> > > ```
> > > +InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
> > > +                                                   Type *Val, unsigned Index) {
> > > +  InstructionCost cost = getVectorInstrCost(I.getOpcode(), Val, Index);
> > > +
> > > +  // According to NEON programmer guide, other than multiply instructions,
> > > +  // instructions that access scalars can access any element in the vector register.
> > > +  //
> > > +  // The cost of an extract depends on whether the user instruction.
> > > +  // 1. If users could use scalars in vector registers directly (e.g., store), the
> > > +  // extract-element operation is essentially free.
> > > +  // 2. If the user instruction requires core register as operand (i.e.,
> > > +  // cannot use scalars in vector register), an explicit move operation will
> > > +  // be codegen'd.
> > > +  //
> > > +  // 'cost' might be an optimistic 0 when lane is 0.
> > > +  // Returns the base cost (as the same with other lanes) if the scalar is an integer.
> > > +  //
> > > +  // FIXME:
> > > +  // The cost of extract-element from vector is better estimated with context (e.g. users).
> > > +  if (!isa_and_nonnull<ExtractElementInst>(&I) || !Val->getScalarType()->isIntegerTy()) 
> > > +    return ST->getVectorInsertExtractBaseCost();
> > > +  return cost;
> > > +}
> > > +
> > > ```
> > > [2]  Basically, vectorizer pass sees one form (scalar or vector) of "physical" instructions (that do exists in the IR), and tries to "provision" the cost of the other form; this means instructions of the other form doesn't exist yet, and only Opcode is known and used to provision the cost.
> > > A related question, I wonder if it's with a reason (or just unintended status quo) that getVectorInstrCost interface doesn't ask for a CostKind parameter?
> > 
> > This is on my TODO list that was waiting for the getInstructionCost refactor (D79483) - but I want to get some other CostKind changes (notably D132216) out of the way first.
> Thanks for making the cost-model changes! (mark this as done).
> 
> 
Resolved by a refactoring that includes code sharing, comment update and dropping the uninteresting case.

> Redundancy elimination might be an alternative option but I haven't got a chance to look into the alternative option yet, with the thought that fixing cost model is overall a good change as well :)*

Just to provide an update on this, looked into how to optimize by redundancy elimination this week -> machine-cse does eliminate the redundant machine instructions (and `LICM/AArch64/extract-element.ll` is supposed to be a reduced test case).

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1981
+                                                   Type *Val, unsigned Index) {
+  InstructionCost cost = this->getVectorInstrCost(I.getOpcode(), Val, Index);
+
----------------
mingmingl wrote:
> RKSimon wrote:
> > fhahn wrote:
> > > nit: LLVM coding style uses `UpperCase` for variables.
> > do you need the this-> ?
> Ah `this->` is not needed, fixed it.
> 
> (Sorry for the delay; I was out in the past week)
With the refactoring this variable is gone. Updated patch should use UpperCase variables already. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128302/new/

https://reviews.llvm.org/D128302