[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

Sun Aug 16 21:10:43 PDT 2015

----- Original Message -----
> From: "Sanjoy Das" <sanjoy at playingwithpointers.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Andrew Trick" <atrick at apple.com>, "Quentin Colombet"
> <qcolombet at apple.com>, "Dan Gohman" <dan433584 at gmail.com>
> Sent: Sunday, August 16, 2015 9:43:54 PM
> Subject: Re: [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution
> 
> > I don't understand why you want to factor out the information,
> > exactly. It seems like what you need is a function like:
> >
> >   unsigned getMinLeadingZeros(const SCEV *);
> >
> > then, if you want to get the non-extended expression, you can just
> > apply an appropriate truncation. I assume, however, that I'm
> > missing
> > something.
> 
> The problem is not about how to codegen a specific SCEV expression
> efficiently, but about the overall solution LSR converges to (which
> includes considering how it will codegen *other* SCEVs).  In the
> latter sense there is a difference between `X` and `(trunc (zext X))`
> even though both of them produce the same value (and possibly are
> equally efficient).
> 
> The example in [1] is roughly like this at the high level:
> 
> You have an i32 induction variable `V` == `{VStart,+,1}`.
> `{VStart,+,1}` does not unsigned-overflow given the loop bounds.  Two
> interesting uses in the loop are something like (in this mail by zext
> I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
> `V ult limit`.
> 
> Because SCEV can prove that `{VStart,+,1}` does not
> unsigned-overflow,
> `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
> VStart), {i64 0,+,1}`.  Similarly, `V` == `trunc (zext V)` ==
> `trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` ==
> `trunc({i64 0,+,1}) + trunc (zext VStart)` ==
> `trunc({i64 0,+,1}) + VStart`.  This is the solution LSR ends up
> choosing.  However, the cheaper solution here is
> 
> `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
> and
> `V ult limit` == `{VStart,+,1} ult limit`
> 
> (i.e. pretty much the direct naive lowering).
> 
> This naive lowering is better because we can use the increment in
> `{VStart,+,1}` to zero extend the value implicitly on x86 (you can
> find the generated machine code at [1]).
> 
> The reason why LSR does not pick this cheaper solution is because it
> "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`.  I'm
> trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
> with `zext({X,+,Y})` when possible on architectures that have "free"
> zero extensions, so that if there is a solution involving the
> narrower
> induction variable, it will be found.

To back up for a second, how much of this is self-inflicted damage? IndVarSimplify likes to preemptively widen induction variables. Is that why you have the extensions here in the first place?

The code model used by IndVarSimplify is a bit simple:

  // We should not widen an indvar if arithmetics on the wider indvar are more
  // expensive than those on the narrower indvar. We check only the cost of ADD
  // because at least an ADD is required to increment the induction variable. We
  // could compute more comprehensively the cost of all instructions on the
  // induction variable when necessary.
  if (TTI &&
      TTI->getArithmeticInstrCost(Instruction::Add, Ty) >
          TTI->getArithmeticInstrCost(Instruction::Add,
                                      Cast->getOperand(0)->getType())) {

and perhaps we need a bit more sophistication here. I've found that, generally speaking, the extensions on inductions variables, and SCEV's preference for factoring them out to varying degrees, ends up making SCEV's job more difficult.

Also, Clang likes to generate code that ends up looking like this (after some optimization):

  %idxprom = sext i32 %i.0 to i64
  %arrayidx = getelementptr inbounds i32, i32* %b, i64 %idxprom

but even here, this is actually unnecessary. An inbounds GEP is defined to use sign-extended arithmetic, and so this could just as easily be:

  %arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.0

In short, I wonder if we're solving the right problem. Maybe the canonicalization of the IR is not as useful as it could be.

 -Hal

> 
> Now I could use the trick you mentioned in the place in LSR where it
> computes the cost of a solution (i.e. when looking for "common"
> registers see if one register is the zero extension of another).  I
> have not tried it, but I suspect it will be fairly tricky -- a given
> solution will now have more than one "cost" depending on how you
> interpret specific registers (as truncations of some extended value
> or
> as a native wide value) and you'd have to search over the possible
> interpretations of a specific solution.
> 
> > Adding simplification barriers into SCEV expressions seems like a
> > dangerous idea, especially since LSR is not the only user of SCEV.
> 
> So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself.
> There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)'
> function that clients of SCEV can use to create an opaque expression.
> Then most of the simplification routines SCEV will be changed to have
> the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'.
> 
> -- Sanjoy
> 
> [1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory