[llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

Mon Aug 17 17:35:49 PDT 2015

On Sun, Aug 16, 2015 at 7:43 PM, Sanjoy Das via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> The problem is not about how to codegen a specific SCEV expression
> efficiently, but about the overall solution LSR converges to (which
> includes considering how it will codegen *other* SCEVs).  In the
> latter sense there is a difference between `X` and `(trunc (zext X))`
> even though both of them produce the same value (and possibly are
> equally efficient).
>
> The example in [1] is roughly like this at the high level:
>
> You have an i32 induction variable `V` == `{VStart,+,1}`.
> `{VStart,+,1}` does not unsigned-overflow given the loop bounds.  Two
> interesting uses in the loop are something like (in this mail by zext
> I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
> `V ult limit`.
>
> Because SCEV can prove that `{VStart,+,1}` does not unsigned-overflow,
> `GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
> VStart), {i64 0,+,1}`.  Similarly, `V` == `trunc (zext V)` ==
> `trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` ==
> `trunc({i64 0,+,1}) + trunc (zext VStart)` ==
> `trunc({i64 0,+,1}) + VStart`.  This is the solution LSR ends up
> choosing.  However, the cheaper solution here is
>
> `GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
> and
> `V ult limit` == `{VStart,+,1} ult limit`
>
> (i.e. pretty much the direct naive lowering).
>
> This naive lowering is better because we can use the increment in
> `{VStart,+,1}` to zero extend the value implicitly on x86 (you can
> find the generated machine code at [1]).
>
> The reason why LSR does not pick this cheaper solution is because it
> "forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`.  I'm
> trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
> with `zext({X,+,Y})` when possible on architectures that have "free"
> zero extensions, so that if there is a solution involving the narrower
> induction variable, it will be found.
>
> Now I could use the trick you mentioned in the place in LSR where it
> computes the cost of a solution (i.e. when looking for "common"
> registers see if one register is the zero extension of another).  I
> have not tried it, but I suspect it will be fairly tricky -- a given
> solution will now have more than one "cost" depending on how you
> interpret specific registers (as truncations of some extended value or
> as a native wide value) and you'd have to search over the possible
> interpretations of a specific solution.
>
>> Adding simplification barriers into SCEV expressions seems like a
>> dangerous idea, especially since LSR is not the only user of SCEV.
>
> So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself.
> There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)'
> function that clients of SCEV can use to create an opaque expression.
> Then most of the simplification routines SCEV will be changed to have
> the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'.

Hi Sanjoy - I also noticed LSR was converging on suboptimal solutions.
I'm hoping you can clarify this discussion for me.
Given a simple copy loop:

void copy(int* src, int* dst, unsigned int size) {
    for (unsigned int i = 0; i < size; ++i)
        dst[i] = src[i];
}

The hot spot for i686 contains 6 instructions with LSR:

8b 32          mov    (%edx),%esi
89 31          mov    %esi,(%ecx)
83 c1 04     add    $0x4,%ecx
83 c2 04     add    $0x4,%edx
48             dec    %eax
75 f3          jne    11 <copy+0x11>

but only 5 instructions without LSR.

8b 3c b2      mov    (%edx,%esi,4),%edi
89 3c b1      mov    %edi,(%ecx,%esi,4)
46               inc    %esi
39 c6          cmp    %eax,%esi
75 f5           jne    14 <copy+0x14>

My interpretation of this problem was that LSR starts analysis with an
SCEV that counts down toward zero even in an upward counting loop.  A
more natural fit would be an initial IV going from 0 to some limit,
but there does not seem to be a way to express this sort of SCEV.  The
initial condition puts an artificially high cost on solutions with
upward counting registers since none match the "free" down counting
SCEV.  Notice the weird 'dec %eax' showing up as an artifact.  The
best solution with base-index scale 4 addressing was considered by
LSR, but incorrectly deemed to be too expensive.

Is this the same LSR problem you describe?

Thanks,
-steve