[cfe-dev] RFC: User-directed code transformations with #pragma clang transform

Tue Dec 17 22:44:39 PST 2019

Dear Clang community,

I am working on adding more powerful compiler optimization directives
that allows the programmer to steer optimization decisions in mid-end.
Most recently presented this work at the LLVM Dev Meeting [2] and I am
looking for reviewers https://reviews.llvm.org/D69088.

------------------------------------------------------------------------------------------------------------
The motivation

is the same as for "#pragma clang loop" and "#pragma omp simd":

 * Separate semantics and optimizations

 * Better maintainability by lowering the code complexity

 * Only switch the pragmas (e.g. using the preprocessor or #pragma omp
metadirective) to optimize the same code base for different targets

 * Easily try out different optimizations without needing to rewrite
the source base

 * Or use an autotuner which inserts #pragma clang transform
directives to find the transformation sequence that has the best
performance.

Such directives already exist in the form of "#pragma clang loop" (aka
loop hints) and "#pragma omp simd", but have several problems:

 * OpenMP is currently limited to Vectorization

 * Loop hints and OpenMP are incompatible to each other

 * Multiple loops hints can be applied to the same loop, but the
transformation order cannot be modified

 * There are no front-end diagnostics when a transformation does not
make sense, e.g. unroll-and-jam without an inner loop

 * Hints are emitted in form of metadata (MDNode) that can be dropped
by mid-end optimizers

 * Loop hints have been added ad-hoc and have inconsistent syntax and
semantics (e.g. whether and how vectorize_width/unroll_count
implicitly enables vectorization/unrolling)

I am also currently working in the OpenMP language committee to add
more loop transformations with defined transformation order that are
compatible with the other OpenMP constructs. So far, tiling has been
added to the TR8 [0] and I hope to also add unrolling to OpenMP 5.1,
with more transformations to follow in OpenMP 6.0. For more details on
the motivation, also see references [2-5].

------------------------------------------------------------------------------------------------------------
The suggested syntax is the following:

    #pragma clang transform unroll [full/partial(n)]
    #pragma clang transform unrollandjam [full/partial(n)]
    #pragma clang transform distribute
    #pragma clang transform vectorize [width(n)]
    #pragma clang transform interleave [factor(n)]

The selection is currently limited to the passes LLVM currently
supports. I am working on more transformations that currently are only
picked-up by Polly. The largest difference to loop hints is that it
allows to specify in which order the transformations are applied,
which is ignored by clang's current LoopHint attribute. That is, the
following for reverses the loop, then unrolls it.

    #pragma clang transform unroll partial(2)
    #pragma clang transform reverse
    for (int i = 0; i < 128; i+=1)
      body(i);

Whereas the following first unrolls the loop, then reverses it:

    #pragma clang transform reverse
    #pragma clang transform unroll partial(2)
    for (int i = 0; i < 128; i+=1)
      body(i);

Observe that the results are different: The former will execute the
loop body in the order

    body(127); body(126); body(125); body(124); ...

whereas the latter should execute

    body(126); body(127); body(124); body(125); ...

If during optimizing performance the programmer reaches the point for
force the compiler to use specific transformations, they will not want
to leave the choice in which order to apply them to the compiler as
their performance characteristics can be quite different.

Furthermore, I intend to implement assigning identifiers to loop to
reference them in followup transformations (e.g. tile a loop,
parallelize the generated outer loop and vectorize the inner), make
then combinable with OpenMP directives, and add options to control
whether to only override the compiler's profitability heuristic or
also the legality check (as "#pragma clang loop
vectorize(assume_safety)" does). Of course, I would like to implement
more transformations  (tiling, fusion, interchange, array packing,
reversal, wavefronting, peeling, splitting, space-filling curves,
unswitching, collapsing, strip/strip-mining, blocking, ...) and
options.

There will always be differences between the OpenMP directives and
"#pragma clang transform": First, we cannot implement clang-specific
non-OpenMP transformations using the OpenMP syntax. Second, OpenMP
follows an "the user is always right" approach without checking
whether the program semantics are preserved. By contrast, loop hints
historically and by default do not change the code's semantics but
take the choice of which transformation out of the legal ones to use.
This, for instance, is important for autotuning to avoid that the
machine learning algorithm chooses an optimization that yields wrong
results.

------------------------------------------------------------------------------------------------------------
This is a followup to a previous RFC [1] Unfortunately, I did not get
any response directly on the mailing list (However, I got responses
off the mailing list). Since then, I uploaded diffs for reviews (see
below). Some reviewers requested an updated RFC [Lex] on the topic
which this is intended to be.

So far, I uploaded the following patches:

* [Lex] https://reviews.llvm.org/D69088
  Also adds an -fexperimental-transform-pragma switch to enable
processing of #pragma clang transform

* [Parser] https://reviews.llvm.org/D69089
  Also adds classes to represent a transformation such as LoopUnrollTransform

 * [Sema] https://reviews.llvm.org/D69091
   Adds the AST nodes kinds: the Stmt representing the #pragma
(TransformExecutableDirective) and the clauses (TransformClause).
Moreover, the AnalysisTransform component constructs a loop nest tree
to which transformations are applied to such that Sema can warn about
inconsistencies, e.g. there is no inner or ambiguous loops for
unrollandjam.

 * [CodeGen] https://reviews.llvm.org/D69092
   The codegen part uses the same AnalysisTransform to determine which
loop metadata nodes to emit.

 * [Serialization] https://reviews.llvm.org/D70572

 * [CIndex] https://reviews.llvm.org/D71447:

 * [Docs] https://reviews.llvm.org/D70032:

The main change since the previous RFC is the syntax: In [1] I
suggested, as did some reviewers in [Lex], to re-use the "#pragma
clang loop" syntax. However, I am now convinced that is is not a good
idea because syntax as well as semantics of #pragma clang loop are
different. Ensuring compatibility with existing code requires quirky
exceptions. Also, I eventually would also like to add transformations
that do apply on code that are not on loops (such as code motion,
task-parallelism, SLP vectorization, etc).

Previously, I used an implementation based on attributes [7] (also
used by #pragma clang loop) but I don't think these are powerful
enough to meet all requirements. For instance, transformations that
apply to loop identifiers for generated loops do not have a
syntactical loop that they could be attached to.

------------------------------------------------------------------------------------------------------------
Applying the transformations

My patches for the front-end part do not apply the transformations,
but add metadata to inform LLVM passes to apply the transformations.
[8] and [9] are about encoding of the transformation order as loop
metadata. Patch [9] has been applied some time ago.

However, the applicability of transformations in LLVM is limited by
the passes that have been implemented and the order in which they are
processed by the pass manager. The new pass manager has an updater
that allows re-processing of loops if a pass applied a transformation,
but the pipeline has to be restructured to allow this. I am using an
extension to Polly [12] to apply transformations in any order since
Polly makes it a lot easier to implement a transformation compared to
implement on from scratch on the IR level.

To not require Polly to implement OpenMP, an alternative to emitting
metadata is to use the OpenMPIRBuilder [13] to directly generate the
transformed IR. However, we will not want to do safety analysis on
this level such that this implementation is limited to OpenMP and
assume_safety transformations.

In the long run, I would like to introduce a framework for more
general transformations [10-11] for which we are currently preparing
another RFC.

With kind regards,
Michael Kruse

[0] OpenMP TR8
https://www.openmp.org/wp-content/uploads/openmp-TR8.pdf

[1] RFC: Extending #pragma clang loop
https://lists.llvm.org/pipermail/cfe-dev/2018-May/058141.html

[2] Loop-transformation #pragmas in the front-end
http://llvm.org/devmtg/2019-10/talk-abstracts.html#tech20

[3] A Proposal for Loop-Transformation Pragmas (IWOMP'19)
https://arxiv.org/abs/1805.03374

[4] User-Directed Loop-Transformations in Clang (LLVM-HPC'18)
https://arxiv.org/abs/1811.00624

[5] Design and Use of Loop-Transformation Pragmas (IWOMP'19)
https://arxiv.org/abs/1910.02375

[6] Prototype implementation using #pragma clang loop and attributes
https://github.com/SOLLVE/llvm-project/tree/pragma-clang-loop

[7] Implementation uploaded for review
https://github.com/SOLLVE/llvm-project/tree/pragma-clang-transform

[8] RFC: Extending loop metadata
https://lists.llvm.org/pipermail/llvm-dev/2018-May/123690.html

[9] D57978: Metadata for follow-up transformations
https://reviews.llvm.org/D57978

[10] Loop Optimization Framework (LCPC'18)
https://arxiv.org/abs/1811.00632

[11] Loop Transformations in LLVM (LLVM DevMtg'18)
https://youtu.be/QpvZt9w-Jik?t=813

[12] Implementation for applying transformations using Polly
https://github.com/SOLLVE/polly/tree/pragma

[13] OpenMPIRBuilder
https://reviews.llvm.org/D69785