[PATCH] D53865: [LoopVectorizer] Improve computation of scalarization overhead.

Tue Dec 18 02:21:40 PST 2018

jonpa updated this revision to Diff 178627.
jonpa added a comment.

Thanks for review and suggestions! I tried your ideas and was pleasantly surprised that it actually worked quite "out of the box" as you said to explicitly scalarize instructions instead of just correcting the costs for them as I first did. :-)

I still think this patch is experimental in the sense that I haven't seen any convincing benchmark improvements on my target (SystemZ) to motivate this extra code in the LoopVectorizer, although it seems to do what it's intended for. I see two main parts in this patch now.

1. Find instructions (sequences) that will be scalarized by target and explicitly scalarize them with correct costs, just like for predicated instructions.
2. Scalarize loads and stores that are connected with a scalarized user/def, instead of always widening them if that would be cheaper.

- For (1), I implemented isScalarWithoutPredication(), per your suggestions to compare the "scalar * VF" against "vector" costs. Unfortunately, I had to add an "InPassedVF" argument to getInstructionCost() and getMemoryInstructionCost(), to not confuse those methods which may pass the Instruction pointer to the target implementation. This extra argument is used for VF=1 by setting it to false, so that the user will eventually not think that this is a scalar instruction in a scalar context and find folding opportunities which may not be present in the vectorized loop. I then took inspiration from computePredInstDiscount() and implemented computeInstDiscount() to find the scalarized sequences. Did not attempt to search past vectorizable instructions, but it recognizes scalarized loads / stores.

At this point I noticed for SystemZ that this affected primarily ~100 loops on z13 containing the float (fp32) type, which is scalarized. Typically, this would be of the type "load; op(s); store", with some variations. This patch made those vector loops cheaper (with more correct scalarization costs, while the scalar version still suffered a bit from not considering the folded loads into the scalar operations such as fadd). The operands scalarization cost was also improved by the fact of checking for loop invariant ones, which is not currently done. (Side notes: I saw that now VF=2 got a good treatment since the loop is scalarized, and the DAG problem of expansion to VF=4 disappeared. Also, I had to disable this for integer div/rem, since we still see spilling if those loops are vectorized).

- (2) needs (1) to work, since the correct cost is only attained if e.g. the scalarized instruction using the load does not also get an (incorrect) extraction cost. Basically, if the vector instruction plus extraction/insertion costs is greater than the cost for scalar instructions, the memory access is scalarized instead of widened.

Unfortunately, this could not be done with computeInstDiscount() as a simple extension of (1), since memory accesses are dealt with before anything else. This had to be done in setCostBasedWideningDecision(). It was as well not possible to use the new isScalarWithoutPredication() at this point, since getInstructionCost() can not be called since the scalars have not been collected. I therefore used the TLI check to see which instructions which would be scalarized, wrapped in a target hook I named preferScalarizedMemFor(). Perhaps there is a better way for this, but for SystemZ I wanted to avoid extractions into GPRs (integer extractions), and also prefer to store 2x64 bit instead of 128 bits if the wide store requires inserts.

This seemed beneficial for loads in places, while I saw that LSR generated poor code for the 2x64 stores with duplicated address computations for address registers, instead of using an immediate offset.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D53865/new/

https://reviews.llvm.org/D53865

Files:
  include/llvm/Analysis/TargetTransformInfo.h
  include/llvm/Analysis/TargetTransformInfoImpl.h
  lib/Analysis/TargetTransformInfo.cpp
  lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
  lib/Target/SystemZ/SystemZTargetTransformInfo.h
  lib/Transforms/Vectorize/LoopVectorize.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D53865.178627.patch
Type: text/x-patch
Size: 17819 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20181218/d290e7c5/attachment.bin>