[PATCH] D30680: new method TargetTransformInfo::supportsVectorElementLoadStore() for LoopVectorizer

Thu Apr 6 10:26:17 PDT 2017

anemet added a comment.

In https://reviews.llvm.org/D30680#718817, @jonpa wrote:

> In https://reviews.llvm.org/D30680#718786, @anemet wrote:
>
> > In https://reviews.llvm.org/D30680#713835, @jonpa wrote:
> >
> > > In https://reviews.llvm.org/D30680#713268, @anemet wrote:
> > >
> > > > Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well.  What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target.  I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.
> > > >
> > > > What do you think?
> > >
> > >
> > > Is this also in the context of scalarizing a load / store?
> > >
> > > For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.
> >
> >
> > We have something like this on ARM too.  ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element.  That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write.  I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate.  Doesn't this match the situation on SystemZ?
>
>
> As far as I know there is on SystemZ no extra penalty for using a vector load element, so scalarizing a vector load will really cost e.g. 4 loads at VF 4. This should be better than doing 4 scalar loads and 4 inserts.

My point is not whether it's better or not (it certainly is shorter) but whether 4 scalar loads have the same cost as four vector-element loads.  The hook would state the latter.  Anyhow, for in-order processors I could see how this could be true.

> Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

Yes, I think so.

> What about the handing of two registers at a time you mentioned earlier?

I convinced myself that that is a separate issue.  There, we want to communicate that to load or store a pair of registers (<=64bits) only takes one instruction in scalar mode.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:437

+  bool supportsVectorElementLoadStore() const;
+
----------------
Needs comment.

https://reviews.llvm.org/D30680