[llvm] r182023 - PPC32 cannot form counter loops around i64 FP conversions

Thu May 16 16:43:25 PDT 2013

----- Original Message -----
> Hal Finkel <hfinkel at anl.gov> wrote on 17.05.2013 00:50:08:
> 
> > > It seems to me that attempting to introduce this sort of "tight
> > > coupling" between an IR pass and a later MI pass will probably
> > > lead to problems as well.
> > >
> > > I'd instead suggest to have two self-contained passes that are
> > > only loosely coupled.  First, an IR pass recognizes likely CTR
> > > loops and rewrites them on the IR level into counting-down loops;
> > > that is a loop that uses regular IR to describe a counter being
> > > set to an initial value, counting it down, and testing it against
> > > zero as condition of the loop back-edge branch.  (This
> > > transformation as such can never lead to wrong code generation
> > > no matter what happens later.  In fact, I'd assume that there
> > > are already loop optimizers that perform exactly this type of
> > > transformation ...)
> > >
> > > Later on, an MI pass detects loops that look on the MI level like
> > > counting-down loops
> >
> > Unfortunately, this "looks like" gets difficult, and this is what
> > motivated me to attempt this on the IR level. In simple cases it is
> > fine, but, as I learned from the process of porting the current
> > Hexagon hardware loops pass, it is possible to fool the MI level
> > pass (or you need to make the MI level pass create silly-looking
> > code). Also, SE is much more powerful for recognising these kinds
> > of
> things.
> 
> Well, I guess my thought was that the MI pass would only even attempt
> to handle the "simple cases" and ignore everything else.  Basically,
> the pass would handle only code that already does "decrement vreg
> by one, branch if vreg nonzero".  No need to compute counts or
> anything on this level.
> 
> However, the IR pass would recognize complex cases and rewrite them
> - completely on the standard IR level - so that the rewritten loop
> would just happen to be one of the simple cases that the MI pass
> recognizes.

Right, and I think that's fine. The problem comes from code that already looks like the simple form, but isn't quite (because it lacks a guard, for example).

> 
> > The problem that the current Hexagon pass has, for example, is that
> > for non-constant-trip-count loops, it assumes that there is a loop
> > guard. This means that there is some comparison that skips the loop
> > if the count is zero or, importantly, negative. However, do-while
> > loops have no guard, and so nothing prevents negative counts from
> > be
> > calculated (which causes miscompiles). So a loop do {--i} while (i
> > >
> > 0), which normally executes only the first iteration, could be
> > calculated to have a large (negative) count if the 'naive' formula
> > is used. Alternatively, we could always generate the isel (or the
> > branch code on earlier processors), but that's silly when we have a
> > guard. And checking for the loop guard also gets hard quickly.
> > Because the loop guards are generated fairly early by the IR-level
> > loop simplification pass, the guard conditions get hoisted,
> > combined
> > with other things, etc. (not that SE does not sometimes generate
> > redundant guards, but at least that is common high-level code that
> > can be improved).
> 
> Can you not do all that on the IR level, and if it detects a count,
> simply rewrite the IR to a simple counting loop?

The other problem is that, if you do this rewriting then you can end up with redundant induction variables if the transformation does not happen.

> 
> > I think that a robust solution has the IR-level generate count
> > expressions, and these count expressions are somehow explicitly
> > tied
> > to the relevant backedge (using some intrinsic). Both the count and
> > this backedge are transformed into pseudo instructions, and a
> > cleanup pass either turns these things into the real mtctr/bdnz
> > instructions (and DCEs any now-dead induction variables and
> > compares), or DCEs the count and turns the compare into a regular
> instruction.
> 
> I don't really like have the "count annotation" on the side,
> since it's just duplicate information and may get out of date
> with subsequent transformation.  If after computing the count,
> you don't use it for a side-band annotation, but instead just
> rewrite the IR using that count information, this problem
> wouldn't be there ...

Yes, I understand, and I agree with you. But I don't want to generate unnecessary guards (or duplicate induction variables), and so...

Maybe it would be better to do it the other way? What if we always generate the counter-based loops, and then undo the transformation if we detect, at the MI level, other clobbers of the counter register. We might end up generating suboptimal code in these cases (because we could end up with effectively two induction variables (that are not exactly the same and so won't be CSE'd), but for loops with function calls, indirect jumps, etc., maybe the performance hit won't matter much -- but I don't really believe that :(

In the end, this is why I decided to do this on the IR level. It is fragile to detect IR sequences that might become function calls, but as LLVM is currently constructed, and because the selection-DAG is basic-block local, there is a finite set of such things and we can audit the code generator for them. I don't *like* it, but it seems to be the only way that we can really generate the best code (without extraneous guard expressions, and without redundant GPR induction variables, etc.) in practice. We could certainly pair this with some kind of 'undo' mechanism just in case. Maybe the best solution is to leave things this way, but make a verification pass that checks for clobbers inside the loops after the fact. This would make it ICE instead of miscompile, but that is preferable (and we could turn it off in non-asserts builds).

Thanks again,
Hal

> 
> Bye,
> Ulrich
> 
>