[PATCH] D109445: [SVE][LoopVectorize] Optimise code generated by widenPHIInstruction

Fri Sep 10 01:48:30 PDT 2021

sdesmalen added a comment.

Thanks @RosieSumpter that explains it well I think.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4771
+          // instruction that is generated for scalar uses of this vector (e.g.
+          // a load instruction) is not folded. Therefore we still need to
+          // calculate values for the first n lanes to avoid a redundant move
----------------
nit: s/is not folded/is not folded away/

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4771-4772
+          // instruction that is generated for scalar uses of this vector (e.g.
+          // a load instruction) is not folded. Therefore we still need to
+          // calculate values for the first n lanes to avoid a redundant move
+          // (when extracting the 0th element) and to produce scalar code (i.e.
----------------
sdesmalen wrote:
> nit: s/is not folded/is not folded away/
nit: s/we still need to calculate/we still calculate/
nit: s/to avoid a redundant move/to avoid redundant moves/

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4769-4771
+          // extraction of any lane. However, to generate better code, we still
+          // need to calculate values for the first n lanes since these could be
+          // required later (e.g. by a load instruction).
----------------
RosieSumpter wrote:
> SjoerdMeijer wrote:
> > sdesmalen wrote:
> > > SjoerdMeijer wrote:
> > > > sdesmalen wrote:
> > > > > Hi @RosieSumpter, I think it's worth elaborating a little bit more on the 'generate better code' in the comment.
> > > > > 
> > > > > [(too) long explanation here]
> > > > > 
> > > > > From what I understand, the code is better because the `extractelement` instruction that is otherwise generated (for scalar uses of this vector) may not always be folded away if the stepvector has multiple uses, leading to a redundant move (in case of element 0 for the vector-element-0 -> gpr move) or possibly expensive extractelement instructions (to extract a fixed-width lane from a scalable vector) for element > 0.
> > > > > 
> > > > > In the former case, the value for element 0 is freely available because it is the start value of the stepvector.
> > > > > In the latter case, there will be a cost regardless. Either the additional `add/gep` generated below to offset from the start value of the stepvector, or the extract from the stepvector itself. It's just expected that the scalar code will be cheaper.
> > > > > 
> > > > > Can you maybe capture some of that in the comment? (albeit more succinctly)
> > > > I had exactly the same questions as Sander. The main question I think is indeed why this is better, which it's not (that) obvious from the test changes. Thus, I was wondering, does this deserve adding some CodeGen tests?
> > > IMO an improved description should be sufficient.
> > Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this? I think it would help too with explaining why this is better.
> For now I have just updated the comment (hopefully it makes sense - I have tried to add some detail but keep it concise, but am happy to change it!) 
> 
> Also happy to add a codegen test if it's decided that it's necessary.
> Ok, but to be more explicit: this shows that the IR -> asm step isn't tested, is it? Why would we not test this?
Correct, it normally doesn't happen for individual IR passes that the resulting asm for a particular target is tested for a change to the transformation, right? IMO that should be avoided whenever possible, because it defeats the purpose of a unit test.

In this case it should be sufficient to test explicitly that the `extractelement` does not exist after loop-vectorize. That happens with the extra `CHECK-NOT` line  in sve-widen-phi.ll. The fmov is the code-generated equivalent to the extract element of element 0, so would there be any value in also code-generating this for AArch64 to show that the fmov is removed?

Perhaps @RosieSumpter can add a similar comment to clarify the `CHECK-NOT` line to clarify this a bit more in the test itself?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109445/new/

https://reviews.llvm.org/D109445