[PATCH] D46130: [NVPTX] Turn on Loop/SLP vectorization

Fri Apr 27 06:50:22 PDT 2018

hfinkel added inline comments.

================
Comment at: lib/Target/NVPTX/NVPTXTargetTransformInfo.h:53
+  // NVPTX has infinite registers of all kinds.
+  unsigned getNumberOfRegisters(bool Vector) const { return 1; }
+
----------------
bkramer wrote:
> hfinkel wrote:
> > jlebar wrote:
> > > bkramer wrote:
> > > > jlebar wrote:
> > > > > bkramer wrote:
> > > > > > jlebar wrote:
> > > > > > > Does 1 have specific meaning?  I don't see this in any of the comments, and that would be a pretty weird API...  (Like, did you mean -1?)
> > > > > > 1 is the default of the generic implementation, I just copied that and removed the check for vector. I'm not even sure if anyone ever checks the value or just compares it against zero.
> > > > > I see a few places using the actual value returned: LoopStrengthReduce, LoopVectorize...
> > > > Right, LoopStrengthReduce calls it with (false), which was returning '1' before and after :)
> > > > 
> > > > I'm not sure about LoopVectorize, but conservatively limiting its vectorization factor is probably a good thing.
> > > OK, I think I see what you're after: You want to make the minimal/safest change that gets vectorization on your testcases?
> > > 
> > > That sounds good to me, but can we make the comment on "return 1" express this motivation?
> > LoopVectorize is using this number to control the amount of interleaving/unrolling it does. The idea being that inverleaving is beneficial until you create too much register pressure (i.e., until you start spilling).
> > 
> > I don't think that setting this to 1 is a good idea. Perfectly reasonable changes to the loop vectorizer in the future could cause this to completely disable vectorizer. You should set this, I suspect, to the maximum number of registers you can have per thread at full occupancy.
> > 
> This is intentional. I want to keep LoopVectorizer as conservative as possible for now. I have benchmarks where SLP makes a clear improvement, for LV it's much less clear and it has the potential of causing huge regressions, especially since NVPTX TTI is not yet tuned for it.
> 
> Happy to reword the comment more in case that's still unclear and put in a note to revisit making the LV more aggressive. WDYT?
Please give the TTI functions meaningful values. It's perfectly plausible that this can be used for something else in the future, and you're just asking for trouble when we do need to change this later because the delta is bound to be large.

That having been said, we certainly do want the LV to be conservative in terms of increasing register pressure. Luckily, we have a separate TTI function for that:

    unsigned getMaxInterleaveFactor(unsigned VF) const;

make this function return 1 for all values of VF.  I believe this is the default, but in this case, I recommend explicitly overriding it with a comment about explicitly wanting to be conservative about reducing register pressure. In terms of the LV, this should have the same current effect.

As I recall, the register file size on a Volta, for example, is 256kB/SM and you can have 2048 threads/SM, so that leaves 32 32-bit registers per thread at full occupancy. Thus, I'd recommend setting this number to 32.

Repository:
  rL LLVM

https://reviews.llvm.org/D46130