[llvm] r174660 - Constrain PowerPC autovectorization to fix bug 15041.

Fri Feb 8 08:58:11 PST 2013

----- Original Message -----
> From: "Bill Schmidt" <wschmidt at linux.vnet.ibm.com>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: llvm-commits at cs.uiuc.edu, hfinkel at anl.gov
> Sent: Friday, February 8, 2013 8:59:13 AM
> Subject: Re: [llvm] r174660 - Constrain PowerPC autovectorization to fix bug 15041.
> 
> On Fri, 2013-02-08 at 07:20 -0600, Bill Schmidt wrote:
> > 
> > On Thu, 2013-02-07 at 12:52 -0800, Nadav Rotem wrote:
> > > Hi Bill,
> > > 
> > > 
> > > Returning a really high constant would prevent vectorization, but
> > > we
> > > can do better. If you look at the ARM and X86 backend you will
> > > see
> > > that we have code to estimate the 'scalarization' cost.  You can
> > > model
> > > the expensive transition of data from scalar to vector registers
> > > by
> > > assigning a high cost to the 'Insert/ExtractElement'
> > > instructions.
> > > This is important because in some loops we have perfectly
> > > vectorizable
> > > code with one 'scalarized' instruction. We still want to catch
> > > these
> > > cases. Additionally, the vectorizer is not the only user of the
> > > cost
> > > model. Some other transformations may want to estimate the cost
> > > of two
> > > alternatives, and in that case 'awful' is not a useful answer.
> > 
> > Thanks, Nadav!  Now that I'm using the correct opcode space,
> > penalizing
> > just the scalarization at least solves the problem for paq8p.  I'll
> > spot
> > check some of the other problems I saw, but hopefully this will
> > kill the
> > worst offenders.
> 
> Attached is my current proposed patch.  Please let me know what you
> think.  This stops vectorization of the paq8p and factor cases; it's
> possible that the LHS penalty will need to be raised if we see other
> cases where scalarization is occurring and shouldn't be.  Thanks for
> all
> the help!

The penalty factor of 12 seems about right, but may need to be a little higher. To model the pipeline flush, I'd think that it should be essentially:
  (pipeline depth)*(ilp factor)
I can imagine this being ~6*2, but the P7 can actually have more than 2 in-flight instructions. Guessing from the diagram in the Sinharoy, et al. 2011 paper, I'd estimate that flushing all pipelines costs, at maximum, ~53 in-flight instructions. Of course, the average fill percentage is probably lower, but we might want to use a worst-case cost here.

LGTM.

Thanks again,
Hal

> 
> Bill
> 
> > 
> > Bill
> > > 
> > > 
> > > Thanks,
> > > Nadav
> > > 
> > > 
> > > On Feb 7, 2013, at 12:33 PM, Bill Schmidt
> > > <wschmidt at linux.vnet.ibm.com> wrote:
> > > 
> > > > +  const unsigned Awful = 1000;
> > > > +
> > > > +  // Vector element insert/extract with Altivec is very
> > > > expensive.
> > > > +  // Until VSX is available, avoid vectorizing loops that
> > > > require
> > > > +  // these operations.
> > > > +  if (Opcode == ISD::EXTRACT_VECTOR_ELT ||
> > > > +      Opcode == ISD::INSERT_VECTOR_ELT)
> > > > +    return Awful;
> > > > +
> > > > +  // We don't vectorize SREM/UREM so well.  Constrain the
> > > > vectorizer
> > > > +  // for those as well.
> > > > +  if (Opcode == ISD::SREM || Opcode == ISD::UREM)
> > > > +    return Awful;
> > > > +
> > > > +  // VSELECT is not yet implemented, leading to use of
> > > > insert/extract
> > > > +  // and ISEL, hence not a good idea.
> > > > +  if (Opcode == ISD::VSELECT)
> > > > +    return Awful;
> > > > +
> > > >   return TargetTransformInfo::getVectorInstrCost(Opcode, Val,
> > > > Index);
> > > > }
> > > > 
> > > 
>