[llvm] r174660 - Constrain PowerPC autovectorization to fix bug 15041.

Thu Feb 7 23:04:56 PST 2013

----- Original Message -----
> From: "Bill Schmidt" <wschmidt at linux.vnet.ibm.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at cs.uiuc.edu
> Sent: Thursday, February 7, 2013 8:32:14 PM
> Subject: Re: [llvm] r174660 - Constrain PowerPC autovectorization to fix bug 15041.
> 
> On Thu, 2013-02-07 at 19:38 -0600, Hal Finkel wrote:
> > 
> > ----- Original Message -----
> > > From: "Bill Schmidt" <wschmidt at linux.vnet.ibm.com>
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > Cc: llvm-commits at cs.uiuc.edu
> > > Sent: Thursday, February 7, 2013 4:52:20 PM
> > > Subject: Re: [llvm] r174660 - Constrain PowerPC autovectorization
> > > to fix bug 15041.
> > > 
> > > On Thu, 2013-02-07 at 15:23 -0600, Hal Finkel wrote:
> > > > ----- Original Message -----
> > > > > From: "Bill Schmidt" <wschmidt at linux.vnet.ibm.com>
> > > > > To: llvm-commits at cs.uiuc.edu
> > > > > Sent: Thursday, February 7, 2013 2:33:57 PM
> > > > > Subject: [llvm] r174660 - Constrain PowerPC autovectorization
> > > > > to
> > > > > fix bug 15041.
> > > > > 
> > > > > Author: wschmidt
> > > > > Date: Thu Feb  7 14:33:57 2013
> > > > > New Revision: 174660
> > > > > 
> > > > > URL: http://llvm.org/viewvc/llvm-project?rev=174660&view=rev
> > > > > Log:
> > > > > Constrain PowerPC autovectorization to fix bug 15041.
> > > > > 
> > > > > Certain vector operations don't vectorize well with the
> > > > > current
> > > > > PowerPC implementation.  Element insert/extract performs
> > > > > poorly
> > > > > without VSX support because Altivec requires going through
> > > > > memory.
> > > > > SREM, UREM, and VSELECT all produce bad scalar code.
> > > > > 
> > > > > There's a lot of work to do for the cost model before
> > > > > autovectorization will be tuned well, and this is not an
> > > > > attempt
> > > > > to
> > > > > address the larger problem.
> > > > > 
> > > > > Modified:
> > > > >     llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > > 
> > > > > Modified:
> > > > > llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > > URL:
> > > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp?rev=174660&r1=174659&r2=174660&view=diff
> > > > > ==============================================================================
> > > > > --- llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > > (original)
> > > > > +++ llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > > Thu
> > > > > Feb
> > > > >  7 14:33:57 2013
> > > > > @@ -194,6 +194,25 @@ unsigned PPCTTI::getVectorInstrCost(unsi
> > > > >                                      unsigned Index) const {
> > > > >    assert(Val->isVectorTy() && "This must be a vector type");
> > > > >  
> > > > > +  const unsigned Awful = 1000;
> > > > 
> > > > Does it need to be this high? I would think that:
> > > > 
> > > > For extract: Cost == 1 (vector store) + 1 (scalar load)
> > > > For insert: Cost == 1 (vector store) + 1 (scalar store) + 1
> > > > (vector
> > > > load)
> > > > For srem/urem: Cost == 1 (vector store) + N (scalar loads) +
> > > > N*O
> > > > (operation costs) + N (scalar stores) + 1 (vector load)
> > > > For vselect: Cost == 1 (vector store) + N (scalar loads) + N*O
> > > > (selects) + N (scalar stores) + 1 (vector load)
> > > > 
> > > > would be pretty accurate. Is that not enough? Do we need
> > > > additional
> > > > costs on the loads to account for splitting the operations
> > > > among
> > > > different dispatch groups?
> > > 
> > > You don't understand the cost of the "load-hit-store" scenario on
> > > most
> > > Power processors.  A store followed relatively quickly by a load
> > > from
> > > the same doubleword suffers a large penalty because of the way
> > > the
> > > store
> > > queues are designed.  We really don't want to vectorize when an
> > > operation is going to go through memory in that fashion.  At a
> > > minimum,
> > > the kinds of costs you describe above need to be inflated by a
> > > value
> > > between 5 and 10 because of processor stall cycles.
> > 
> > Fair enough. We should probably make a comment to this effect in
> > the cost model definition.
> 
> Agreed.
> 
> > 
> > > 
> > > The word "Awful" was chosen for a reason -- the code being
> > > generated
> > > by
> > > vectorization in paq8p was absolutely horrific. ;)  I ran into
> > > the
> > > same
> > > thing for urem in another test when I was playing around with
> > > getting
> > > vector-select working (which was less straightforward than I'd
> > > hoped).
> > > 
> > > In all honesty, if a real cost model isn't going to be built, it
> > > would
> > > seem better to just shut down auto-vectorization for PowerPC.
> > >  I'm
> > > sure
> > > it does well for very simple cases, but the things I've seen were
> > > amazingly bad (a 15x increase in code size and very poor
> > > performance
> > > on
> > > the simple code in SingleSource/.../factor.c).
> > 
> > Is the default cost model really that bad except for the
> > scalarization (and insert/extract) costs?
> 
> I don't really know what else is wrong at this point.  In some of
> these
> cases it looked like somebody opened a file and vomited instructions
> into it. ;)  These were the things that stood out immediately as Not
> a
> Good Idea.

Yea, the scalarization will do that; it is very ugly.

> 
> The basic concerns are:
>  - Insert/extract on Altivec requires going through memory; and
>  - Any scalarization that requires use of integer instructions also
> requires going through memory, because there is no direct path
> between
> the fixed-point and floating-point units.
> 
> For the latter, the issues I've noticed are VSELECT and SREM/UREM.
>  This
> is likely not an exhaustive list.
> 
> In both of these cases, the cost of going through memory is not only
> the
> additional instructions, but also the cost of the reject for the load
> that follows the store.

Understood, although I'd like to be optimistic and think that there is a reasonably-easy way to model this.

> 
> I agree that just returning a huge number isn't the best long-term
> solution, but I wanted to at least get something out there to speed
> up
> the test runs.  That will help with efforts to debug our LNT buildbot
> problems.

Understood.

> 
> Even with these basic constraints in place, I'm seeing some
> benchmarks
> running ~60x slower than GCC, and my guess is this reflects
> additional
> vectorization cases.

Yuck.

> 
> > 
> > > 
> > > I'm open to trying to do a little better for these particular
> > > cases,
> > > but
> > > I personally don't have the time to spend to put together a real
> > > cost
> > > model for all instructions.  I'm just trying to patch up the
> > > spurting
> > > artery I'm seeing for these few ugly scenarios.
> > 
> > Okay.
> 
> That sounded worse than I meant it to.  What I mean is that my
> current
> focus is still on ensuring all essential features are present in the
> back end and ABI-compatible.  Fixing performance issues is something
> I
> don't have a lot of time for until that's done.  This was egregious
> enough that tests were failing by timing out, so I felt I should at
> least try to get a band-aid on it, but I would like to see someone
> with
> more interest in autovectorization working on the long-term
> solutions.

Understood.

> 
> > 
> > > 
> > > I have commitments this evening, but will look at Nadem's
> > > suggestions
> > > tomorrow.
> 
> Ack, sorry Nadav, didn't mean to conflate your first and last names.
> ;)
> 
> > 
> > I what Nadav was talking about is this:
> > 
> > unsigned BasicTTI::getScalarizationOverhead(Type *Ty, bool Insert,
> >                                             bool Extract) const {
> >   assert (Ty->isVectorTy() && "Can only scalarize vectors");
> >   unsigned Cost = 0;
> > 
> >   for (int i = 0, e = Ty->getVectorNumElements(); i < e; ++i) {
> >     if (Insert)
> >       Cost +=
> >       TopTTI->getVectorInstrCost(Instruction::InsertElement, Ty,
> >       i);
> >     if (Extract)
> >       Cost +=
> >       TopTTI->getVectorInstrCost(Instruction::ExtractElement, Ty,
> >       i);
> >   }
> > 
> >   return Cost;
> > }
> > 
> > We might need to make this overridable by the target as well? I'm
> > not sure that adding the costs in this way will, by itself,
> > capture the load-after-store hazards that you described.
> 
> Hm.  I was adjusting the cost of ISD::INSERT_VECTOR_ELT vs.
> Instruction::InsertElement, etc. -- apparently I'm not doing it
> right?

Yes. Also, when I added the initial PPCTTI file I included some CostModel tests to validate the cost model changes. Please do the same for these and we'll be sure to get them right.

> In any case, setting those individual costs high should be adequate
> (perhaps a cost of 10 per element or so).

Sounds good.

> 
> How does this fit in with scalarizing a UREM, for example?  Each
> element
> has the insert and extract cost along with about 7 scalar
> instructions
> to calculate the remainder (assuming the setup code for the magic
> number
> can be factored out of the loop).  How do I ensure that cost is
> accounted for?

I think that the scalarization cost is added to the N*(instruction) cost, so you should not have to do anything special.

Thanks again,
Hal

> 
> Thanks,
> Bill
> 
> > 
> > Thanks again,
> > Hal
> > 
> > > 
> > > Thanks,
> > > Bill
> > > 
> > > > 
> > > > Thanks again,
> > > > Hal
> > > > 
> > > > > +
> > > > > +  // Vector element insert/extract with Altivec is very
> > > > > expensive.
> > > > > +  // Until VSX is available, avoid vectorizing loops that
> > > > > require
> > > > > +  // these operations.
> > > > > +  if (Opcode == ISD::EXTRACT_VECTOR_ELT ||
> > > > > +      Opcode == ISD::INSERT_VECTOR_ELT)
> > > > > +    return Awful;
> > > > > +
> > > > > +  // We don't vectorize SREM/UREM so well.  Constrain the
> > > > > vectorizer
> > > > > +  // for those as well.
> > > > > +  if (Opcode == ISD::SREM || Opcode == ISD::UREM)
> > > > > +    return Awful;
> > > > > +
> > > > > +  // VSELECT is not yet implemented, leading to use of
> > > > > insert/extract
> > > > > +  // and ISEL, hence not a good idea.
> > > > > +  if (Opcode == ISD::VSELECT)
> > > > > +    return Awful;
> > > > > +
> > > > >    return TargetTransformInfo::getVectorInstrCost(Opcode,
> > > > >    Val,
> > > > >    Index);
> > > > >  }
> > > > >  
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > llvm-commits mailing list
> > > > > llvm-commits at cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 
>