[llvm] r174660 - Constrain PowerPC autovectorization to fix bug 15041.

Thu Feb 7 18:32:14 PST 2013

On Thu, 2013-02-07 at 19:38 -0600, Hal Finkel wrote:
> 
> ----- Original Message -----
> > From: "Bill Schmidt" <wschmidt at linux.vnet.ibm.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvm-commits at cs.uiuc.edu
> > Sent: Thursday, February 7, 2013 4:52:20 PM
> > Subject: Re: [llvm] r174660 - Constrain PowerPC autovectorization to fix bug 15041.
> > 
> > On Thu, 2013-02-07 at 15:23 -0600, Hal Finkel wrote:
> > > ----- Original Message -----
> > > > From: "Bill Schmidt" <wschmidt at linux.vnet.ibm.com>
> > > > To: llvm-commits at cs.uiuc.edu
> > > > Sent: Thursday, February 7, 2013 2:33:57 PM
> > > > Subject: [llvm] r174660 - Constrain PowerPC autovectorization to
> > > > fix bug 15041.
> > > > 
> > > > Author: wschmidt
> > > > Date: Thu Feb  7 14:33:57 2013
> > > > New Revision: 174660
> > > > 
> > > > URL: http://llvm.org/viewvc/llvm-project?rev=174660&view=rev
> > > > Log:
> > > > Constrain PowerPC autovectorization to fix bug 15041.
> > > > 
> > > > Certain vector operations don't vectorize well with the current
> > > > PowerPC implementation.  Element insert/extract performs poorly
> > > > without VSX support because Altivec requires going through
> > > > memory.
> > > > SREM, UREM, and VSELECT all produce bad scalar code.
> > > > 
> > > > There's a lot of work to do for the cost model before
> > > > autovectorization will be tuned well, and this is not an attempt
> > > > to
> > > > address the larger problem.
> > > > 
> > > > Modified:
> > > >     llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > 
> > > > Modified:
> > > > llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > URL:
> > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp?rev=174660&r1=174659&r2=174660&view=diff
> > > > ==============================================================================
> > > > --- llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
> > > > (original)
> > > > +++ llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp Thu
> > > > Feb
> > > >  7 14:33:57 2013
> > > > @@ -194,6 +194,25 @@ unsigned PPCTTI::getVectorInstrCost(unsi
> > > >                                      unsigned Index) const {
> > > >    assert(Val->isVectorTy() && "This must be a vector type");
> > > >  
> > > > +  const unsigned Awful = 1000;
> > > 
> > > Does it need to be this high? I would think that:
> > > 
> > > For extract: Cost == 1 (vector store) + 1 (scalar load)
> > > For insert: Cost == 1 (vector store) + 1 (scalar store) + 1 (vector
> > > load)
> > > For srem/urem: Cost == 1 (vector store) + N (scalar loads) + N*O
> > > (operation costs) + N (scalar stores) + 1 (vector load)
> > > For vselect: Cost == 1 (vector store) + N (scalar loads) + N*O
> > > (selects) + N (scalar stores) + 1 (vector load)
> > > 
> > > would be pretty accurate. Is that not enough? Do we need additional
> > > costs on the loads to account for splitting the operations among
> > > different dispatch groups?
> > 
> > You don't understand the cost of the "load-hit-store" scenario on
> > most
> > Power processors.  A store followed relatively quickly by a load from
> > the same doubleword suffers a large penalty because of the way the
> > store
> > queues are designed.  We really don't want to vectorize when an
> > operation is going to go through memory in that fashion.  At a
> > minimum,
> > the kinds of costs you describe above need to be inflated by a value
> > between 5 and 10 because of processor stall cycles.
> 
> Fair enough. We should probably make a comment to this effect in the cost model definition.

Agreed.

> 
> > 
> > The word "Awful" was chosen for a reason -- the code being generated
> > by
> > vectorization in paq8p was absolutely horrific. ;)  I ran into the
> > same
> > thing for urem in another test when I was playing around with getting
> > vector-select working (which was less straightforward than I'd
> > hoped).
> > 
> > In all honesty, if a real cost model isn't going to be built, it
> > would
> > seem better to just shut down auto-vectorization for PowerPC.  I'm
> > sure
> > it does well for very simple cases, but the things I've seen were
> > amazingly bad (a 15x increase in code size and very poor performance
> > on
> > the simple code in SingleSource/.../factor.c).
> 
> Is the default cost model really that bad except for the scalarization (and insert/extract) costs?

I don't really know what else is wrong at this point.  In some of these
cases it looked like somebody opened a file and vomited instructions
into it. ;)  These were the things that stood out immediately as Not a
Good Idea.

The basic concerns are:
 - Insert/extract on Altivec requires going through memory; and
 - Any scalarization that requires use of integer instructions also
requires going through memory, because there is no direct path between
the fixed-point and floating-point units.

For the latter, the issues I've noticed are VSELECT and SREM/UREM.  This
is likely not an exhaustive list.

In both of these cases, the cost of going through memory is not only the
additional instructions, but also the cost of the reject for the load
that follows the store.

I agree that just returning a huge number isn't the best long-term
solution, but I wanted to at least get something out there to speed up
the test runs.  That will help with efforts to debug our LNT buildbot
problems.

Even with these basic constraints in place, I'm seeing some benchmarks
running ~60x slower than GCC, and my guess is this reflects additional
vectorization cases.

> 
> > 
> > I'm open to trying to do a little better for these particular cases,
> > but
> > I personally don't have the time to spend to put together a real cost
> > model for all instructions.  I'm just trying to patch up the spurting
> > artery I'm seeing for these few ugly scenarios.
> 
> Okay.

That sounded worse than I meant it to.  What I mean is that my current
focus is still on ensuring all essential features are present in the
back end and ABI-compatible.  Fixing performance issues is something I
don't have a lot of time for until that's done.  This was egregious
enough that tests were failing by timing out, so I felt I should at
least try to get a band-aid on it, but I would like to see someone with
more interest in autovectorization working on the long-term solutions.

> 
> > 
> > I have commitments this evening, but will look at Nadem's suggestions
> > tomorrow.

Ack, sorry Nadav, didn't mean to conflate your first and last names. ;)

> 
> I what Nadav was talking about is this:
> 
> unsigned BasicTTI::getScalarizationOverhead(Type *Ty, bool Insert,
>                                             bool Extract) const {
>   assert (Ty->isVectorTy() && "Can only scalarize vectors");
>   unsigned Cost = 0;
> 
>   for (int i = 0, e = Ty->getVectorNumElements(); i < e; ++i) {
>     if (Insert)
>       Cost += TopTTI->getVectorInstrCost(Instruction::InsertElement, Ty, i);
>     if (Extract)
>       Cost += TopTTI->getVectorInstrCost(Instruction::ExtractElement, Ty, i);
>   }
> 
>   return Cost;
> }
> 
> We might need to make this overridable by the target as well? I'm not sure that adding the costs in this way will, by itself, capture the load-after-store hazards that you described.

Hm.  I was adjusting the cost of ISD::INSERT_VECTOR_ELT vs.
Instruction::InsertElement, etc. -- apparently I'm not doing it right?
In any case, setting those individual costs high should be adequate
(perhaps a cost of 10 per element or so).

How does this fit in with scalarizing a UREM, for example?  Each element
has the insert and extract cost along with about 7 scalar instructions
to calculate the remainder (assuming the setup code for the magic number
can be factored out of the loop).  How do I ensure that cost is
accounted for?

Thanks,
Bill

> 
> Thanks again,
> Hal
> 
> > 
> > Thanks,
> > Bill
> > 
> > > 
> > > Thanks again,
> > > Hal
> > > 
> > > > +
> > > > +  // Vector element insert/extract with Altivec is very
> > > > expensive.
> > > > +  // Until VSX is available, avoid vectorizing loops that
> > > > require
> > > > +  // these operations.
> > > > +  if (Opcode == ISD::EXTRACT_VECTOR_ELT ||
> > > > +      Opcode == ISD::INSERT_VECTOR_ELT)
> > > > +    return Awful;
> > > > +
> > > > +  // We don't vectorize SREM/UREM so well.  Constrain the
> > > > vectorizer
> > > > +  // for those as well.
> > > > +  if (Opcode == ISD::SREM || Opcode == ISD::UREM)
> > > > +    return Awful;
> > > > +
> > > > +  // VSELECT is not yet implemented, leading to use of
> > > > insert/extract
> > > > +  // and ISEL, hence not a good idea.
> > > > +  if (Opcode == ISD::VSELECT)
> > > > +    return Awful;
> > > > +
> > > >    return TargetTransformInfo::getVectorInstrCost(Opcode, Val,
> > > >    Index);
> > > >  }
> > > >  
> > > > 
> > > > 
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > 
> > > 
> > 
> > 
>