[Openmp-commits] [clang] [llvm] [mlir] [openmp] [LoopTiling][Clang][MLIR] Canonical Intra-tile Loops (PR #191114)

Tue May 26 08:59:35 PDT 2026

https://github.com/Meinersbur commented:

Please create separate PRs for Clang and MLIR. These are separate implementations don't need to be reviewed together. I am focusing on Clang in this PR.

None of the added test are actually requiring the intratile loops to be canonical. That is, something where an surrounding loop-associated directives affects a grid/floor loop as well as a intratile loop at the same time. For instance:
```cpp
#pragma omp for collapse(2)
#pragma omp tile size(2)
for (int i = 0; i < n; ++i) 
   ...
```
or two-level tiling:
```cpp
#pragma omp tile sizes(3, 5)
#pragma omp tile size(2)
for (int i = 0; i < n; ++i) 
   ...
```

`IntraTileMustBeCanonical` is an ok-ish first heuristic, a refinement would be to which of the tile loops are actually affected and thus would need to be canonical. For instance:
```cpp
#pragma omp for collapse(3)
#pragma omp tile size(2, 3)
for (int i = 0; i < n; ++i) 
  for (int j = 0; j < n; ++j) 
   ...
```
here, the intratile-loop for `j` does not need to be canonical, but `i` does. Because as you found out emitting the canonical version of the loop inhibits vectorization, it would be quite important to eventually have this.

Because Clang implements loop-transformations using AST transforms, I think it would be easier and more accurate if we always emit the canonical form, but emit a marker that allows to optimize the canonical form to the non-canonical one later, as suggested [here](https://github.com/llvm/llvm-project/pull/191114#issuecomment-4313291974). For instance, it could also emit an alternatve AST depending on the situation we are in, like `OMPCanonicalLoopNestTransformationDirective::getTransformed()` does. Imagine such an AST:
```
OMPLoopAlternative
`- getCanonical(): ForStmt // Predicated version when canonical version is needed
   `- for (int i = 0; i < n; ++)
      `- if (i < .floor.iv+tilesize)
`- getOptimized(): ForStmt // Vectorizable version when not used as an affected loop
   `- for (int i = 0; i < min(n, .floor.iv+tilesize; ++)
```
Storing both versions at the same time might not be the best solution due to the overhead of storing both. Instance, it could just add a marker (such as `AttributedStmt`) to the loop which says that the ForStmt can be emitted as the optimized one at CodeGen, i.e.:
```
for (int i = 0; i < n; ++)
`- AttributedStmt __attribute__((__invariant_pred_bound)) // States that the RHS is invariant to the loop using the LHS as IV
    `- if (i < .floor.iv+tilesize)
        `- ...
```
is emitted as 
```
for (int i = 0; i < min(n,  .floor.iv+tilesize); ++)
`- ...
```

I think both approaches are viable, but we don't need to start with a `IntraTileMustBeCanonical` if eventually we are going to do the second. Which approach would you prefer?

When I try this out, I get 3 failing tests:
```
Failed Tests (3):
  libomp :: transform/tile/foreach.cpp
  libomp :: transform/tile/intfor.c
  libomp :: transform/tile/iterfor.cpp
```
Test updates in this PR are adding more iterator evaluations, which assume is because mostly because fewer invariant values are precomputed? Ideally, If `IntraTileMustBeCanonical` is false it generates the same code. This should actually be easier to do with an `IntraTileMustBeCanonical` heursitic than with the marker approach.

https://github.com/llvm/llvm-project/pull/191114