[llvm-dev] Writing loop transformations on the right representation is more productive

Fri Jan 10 14:09:57 PST 2020

On Fri, 3 Jan 2020 at 11:27, Michael Kruse via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 1. Candidate selection through cost function
> --------------------------------------------
> Instead of needing to know which transformation is profitable before
> applying it, create a copy of the data structure, modify it, and
> compare it to the original. Moreover, holding multiple, differently
> optimized, copies allows evaluating each variant using a cost function
> and select the 'best' when re-generating LLVM-IR again (or re-use the
> original LLVM-IR).

This sounds a lot like VPlan.

> Instantiating every possible sequence of transformations of course is
> not feasible, so the search space needs to be pruned. This could be
> made dependent on the optimization level.

Are you planning on using heuristic searches? This could make the
vectoriser unstable upon small input changes and therefore hard to get
consistent results and testing.

I'm not against such idea, but I think we need to be conservative in
such a core component of the compiler.

It would be nice to have -Ogocrazy to mean "keep going until you find
something", but usually, -O3 should terminate. :)

> 2. Cheap copies using Red/Green Trees

This seems like an elegant approach to a complex problem.

> 3. Apply transformations from high-level to low-level
> -----------------------------------------------------
> Optimization should be applied from very specialized to very general
> (potentially after some canonicalization). For instance, the first
> step could be detecting common idioms such as gemm and replace them
> with either a BLAS function call or apply well-studied optimizations
> like BLIS to them. After such an idiom has been detected, no other
> transformation should be applied to them.

I'm sceptical to such a machinery. People usually write bad code (me
included) and trying to mach multiple patterns to the same semantics
will be hard, considering how lenient C++ is to pointer handling and
type conversions.

If you do find a match and convert to a library function call, then
well, you can't do anything with it even if you wanted to. :)

> Mid-level transformations may try to map entire loop nests to cache-
> and compute hierarchies (SIMT threads, multiprocessors, offloading,
> etc) by applying transformations such as tiling, loop interchange and
> array packing.

This is hard but very profitable. However, feels to me again that this
is just VPlan packed differently.

While VPlan still has no way yet to handle even simple outer-loops
(has that landed yet?), once we do, then the natural progression will
be to start understanding their semantics and possibly make high level
assumptions like that.

> 6. Late fallback versioning at IR regeneration
> ------------------------------------------
> When a transformation is applied, it can attach conditions (no
> aliasing, no integer wrap, value restrictions, etc.) under which the
> transformation is valid. During LLVM-IR generation, these conditions
> are collected and emitted as run-time conditions. If the condition
> fails, the original code is executed.

This sounds like it will bloat code for a lot of cold cases. Or worse,
get it wrong, and put hot code in the cold path.

> 7. Standardized API for analyses with multiple implementations

These are good to have regardless of which vectorisation strategy we use.

> 8. Abstract representation of statements
> ----------------------------------------
> For instance, assuming that %add is not used a second time, in
> the example below
>
>     %add = add i64 %arg, 2
>     %mul = shl i64 %add, 1
>
> the two instructions should always be computed together in the same
> loop iteration.

This may inhibit further combines, or even detection of target
specific patterns for SIMD code that aren't common.

I agree that not forcefully binding statements with instructions is a
good idea, but this may need a target-specific pattern matcher to be
more sensitive to target idiosyncrasies.

> 9. Expansion of use-def chains to arrays when spanning loops
> ------------------------------------------------------------
> The transforming pass has to consider this during its profitability
> model. The big advantage is that in terms of correctness, use-def
> chains do not manifest false dependencies.

Sounds good, but also creates the problem of how to handle the array.
If 'n' is unknown, or dependent on SIMD widths or number of threads,
it's too low level to add anything that is guaranteed to not change
the performance profile of the original loop.

> Q: Relation to the new/legacy pass manager?
> A: LLVM's pass managers are unfortunately not designed to apply to
> subtrees nor persistent data structures besides the LLVM-IR itself.

By design. The more alternative persistent data structures you have
being modified by a series of passes, the harder it is to know what
did what and where you are.

> Instead, the loop tree optimizer would be its own monolithic pass on
> the pass manager level (like MachinePassManager and VPlan). My idea is
> to add it somewhere before LoopVectorize, but after the inliner,
> potentially replace most other loop transformations.

To me this almost sounds like Polly. Take LLVM IR into a completely
different representation, do a bunch of transformations with it,
re-generate LLVM IR and spits it back into the pipeline.

By that time, all analyses have to be invalidated. All
canonicalisations that had been done will probably be destroyed and
many current pattern matches will stop working. This infrastructure is
only meaningful at the function level or higher, so the potential for
wide range destruction is not trivial.

Don't get me wrong, I like the idea, it's a cool experiment using some
cool data structures and algorithms. But previous experiences with the
pass manager have, well, not gone smooth in any shape or form.

> Q: Relation to LoopVectorize/VPlan?
> A: VPlan has similar design goals [9] but is intended for
> vectorization only.

Again, by a conservative design. I think adding yet-another won't help.

My point is: if this is the way to go, then we should start to think
how we make everything that makes sense become part of this scheme.
Splitting the pass manager into SSA and Tree, run some passes in one
others in the other, and so on.

But creating multiple, incompatible and potentially destructive whole
new pass managers will make a hard job impossible.

> However, it lacks cheap copies. Instead
> of instructions, it uses recipes/"meta-instructions" that handle what
> happens to instructions after vectorization, e.g. do that operation on
> each vector lane ("WIDEN").

Nothing stops us from implementing a leaner approach to VPlan. It
wouldn't be a trivial implementation, but the volume of work that
would be required in this proposal is staggering, too.

> VPlan is more oriented towards modifying
> instructions instead of statements as collection of instructions.

Fair enough, the design was to enhance SIMD code generation, not any
kind of parallel semantics. I guess it would be possible to add the
concept of higher level blocks to VPlan.

All in all, VPlan is young and in constant refactoring, and perhaps it
would be more productive to move it towards a more inclusive approach
than throwing it away before it fully matures to start a whole new
project.

https://xkcd.com/927/

> Q: Relation to MLIR?
> A: MLIR is more similar to LLVM-IR than a loop hierarchy. For
> instance, it also does not feature cheap copies.

If you treat MLIR as your red tree, you could create a green tree
(perhaps as a dialect) and have cheap copies (passing the dialects and
deltas without passing the base).

> An advantage is that
> loops and multi-dimensional arrays can be represented in the language
> without the need of being rediscovered, but have to be inserted by a
> front-end.

Not necessarily. We have discussed introducing dialect annotation to
MLIR during compile time from analysis passes that would basically do
what the front-end should have done.

Conclusions?

This was a long email, with too many proposals, so I don't have any
strong opinion or conclusions, not even from my own comments.

Overall, I like a lot of the ideas (red/green, tree optimisation,
different search strategy), but I dislike the encompassing proposal to
*replace* a lot of the existing infrastructure.

For better or worse, LLVM is a product of its age. Some things could
have been done better, but we have always adopted the "general
consensus and slow movement" way to change things. Sometimes too slow,
but...

Now, MLIR can be a way to speed that up.

It is a much more malleable format than LLVM IR, it was designed for
high-level representation, has a lot of parallelism concepts in it and
it's supposed to interact with LLVM IR seamlessly.

It may be much easier to use MLIR to interoperate the two "pass
managers" _together_ than converting from one to the other and vice
versa.

This is a bold claim and I have no evidence it could ever work. But I
think it would still be less work than creating yet another pass
manager from scratch.

cheers,
--renato