[cfe-dev] RFC: Extending #pragma clang loop

Wed May 30 19:54:03 PDT 2018

Hi cfe-dev,

we are working on enhancing LLVM's loop-optimization infrastructure,
which will include adding more loop transformations and ran into some
problems of the current way Clang currently handles loop hints [5,6].
We hope to get feedback from the community for the proposal below,
particularly about upstreaming such changes to the Clang repository.
Our vision is to express most optimizations using pragmas involving
loops using pragmas. For example, a matrix-multiplication kernel with
comparable performance to hand-optimized BLAS libraries could be
written as:

    for (int i = 0; i < M; i+=1)
      for (int j = 0; j < N; j+=1) {
        #pragma clang section id(zero)
        { C[i][j] = 0; }
        for (int k = 0; k < K; k+=1)
          C[i][j] += A[i][k] * B[k][j];
      }

    #pragma clang loop(i,j) distribute sections(zero,k)
    #pragma clang loop(i,j,k) tile sizes(131,23,21) \
        pit_ids(ip,jp,kp) tile_ids(it,jt,kt)
    #pragma clang loop(ip,..,jt) interchange \
        permutation(jp,kp,ip,jt,it)
    #pragma clang loop(it,kt) pack array(A)
    #pragma clang loop(jt,kt) pack array(B)
    #pragma clang loop(it) vectorize

Any transformation might also applied by a pass heuristic even if not
annotated. For instance, LoopVectorizer would vectorize even without
the pragma, but depending on the VectorizationPlan's cost model,
vectorize the innermost loop instead.

In short, the proposal consists of the following parts:

1. Define an order in which multiple loop transformations are applied

2. Make it possible to assign names to loops and refer to them in
transformations

3. Change the general syntax that is more suited for transformations
with more than one option

4. Add more loop transformations

5. An implementation in Clang

6. Propose such an extension to the OpenMP standard

In more detail:

1. Currently, when multiple transformations are specified to the same
loop, they are ordered as defined by the ordering of their passes in
the pass manager. IMHO this leaks an implementation detail into the
language.

IMHO, a loop transformation (that does not refer to a named loop to
apply on) should transform on either the following for loop in the
source or the loop that results from the following loop
transformation, whichever applies. That is, loop transformation "stack
up".

2. Some transformations result in more that one loop or apply on more
than one loop. Tiling as in the introductory example has both. In
these cases a disambiguation is needed.

Our suggestion the ability to assign identifiers to loops, as in:

    #pragma clang loop id(myloop)
    for (int i = 0; i < M; i+=1)
      ...

and refer to them in other transformations. To avoid boilerplate,
loops could be assigned implicit loop names derived from the loop
counter variable name, if it is unique within a function.

The also allows to specify the transformations separately from the
loop. This could be used, for instance, to apply platform-specific
optimizations:

    #ifdef __CUDACC__
    #pragma clang loop(i,j,k) tile sizes(131,23,21)
    #else
    ....

3. The current syntax for loop vectorization with user-guaranteed
safety and vectorization width of 16 is:

    #pragma clang loop vectorize(assume_safety) vectorize_width(16)

which could be confused with two consecutive vectorizations. Our
proposal is an alternative syntax that is more similar to OpenMP's
declaration followed by clauses style, which becomes even more
interesting for transformations that support more than 2 options:

    #pragma clang loop vectorize assume_safety width(16)

For compatibility, the legacy syntax can be supported for the
already existing #pragmas, including the pass-manager defined execution order.
proposes to add a #pragma clang loop unroll_and_jam using the old
syntax, which would require more compatibility exceptions.

4. The usefulness of the proposition 1 to 3 might be limited for the
transformations Clang currently supports (unrolling,
vectorization/interleaving, distribution), but become more important
with additional transformations. There is already a LoopInterchange
pass, for which a #pragma clang loop interchange could be added.
Unroll-and-jam will be added by [4]. [1] presents our ideas for more
transformations.

5. I am currently working on a prototype at [2]. Here is an outline of
the implementation so far.

Instead of using the path that "#pragma omp simd" goes (using captured
statements), it follows the more lightweight path of the other #pragma
clang loop annotations (using attributes), passing it through Clang's
layers.

* Preprocessor:
For the legacy syntax ("vectorize", "interleave", "distribute" or
"unroll" followed by an opening parenthesis), use the old code.
Otherwise, push an tok::annot_pragma_loop_transform (instead of
tok::annot_loop_hint) token to the preprocessor stack.

* Parser:
Instead of a general LoopHintAttr, each transformation has its own
attribute. #pragmas that refer to a loop name instead of the following
loop are forwarded to Sema using its own ActOn function.

* Sema:
Transformations using a loop name annotate the function, not the following loop.

* Codegen:
I will post an RFC about the changed metadata format and how the IR is
transformed to the llvm-dev mailing list.

6. We also intend to propose such loop transformations to the OpenMP
standard [1]. An implementation in Clang could serve as a
proof-of-concept. Even if such loop transformations do not make it
into OpenMP, an implementation in Clang would be useful for users of
Clang.

Michael

[1] https://arxiv.org/abs/1805.03374
[2] https://github.com/Meinersbur/clang/tree/pragma
[4] https://reviews.llvm.org/D47267
[5] http://llvm.org/docs/Vectorizers.html
[6] https://clang.llvm.org/docs/LanguageExtensions.html#id21