[Mlir-commits] [mlir] [MLIR][SCF] Add canonicalization pattern to fold away iter args of scf.forall (PR #90189)

Fri May 3 03:12:04 PDT 2024

================
@@ -1509,6 +1510,203 @@ class ForallOpControlOperandsFolder : public OpRewritePattern<ForallOp> {
   }
 };
 
+/// The following canonicalization pattern folds the iter arguments of
+/// scf.forall op if :-
+/// 1. The corresponding result has zero uses.
+/// 2. The iter argument is NOT being modified within the loop body.
+/// uses.
+///
+/// Example of first case :-
+///  INPUT:
+///   %res:3 = scf.forall ... shared_outs(%arg0 = %a, %arg1 = %b, %arg2 = %c)
+///            {
+///                ...
+///                <SOME USE OF %arg0>
----------------
matthias-springer wrote:

> So, my understanding is that "each such op" will have shared_out block argument and it is always supposed to be in the "Destination" operand.

That is correct. If there is no op in the terminator that uses a certain bbarg as a destination, then that bbarg can be removed. You can think of the bbarg as the buffer that is being modified in parallel. If that buffer does not appear in the terminator, then we are not modifying the buffer at all; sure, we may have other ops inside the loop body that insert into the buffer, but if that tensor value is then not fed into a parallel_insert_slice, then these insertion never become visible to the other threads. That's why it's safe to use the init arg instead of the bbarg in the loop body. The bufferization framework will privatize the buffer for each thread.

> I don't think we should constrain the use of the iter_args to be ONLY that. Ideally the use would be that a slice of iter_arg is being extracted, performed some computation on and then stored back into the same iter_arg.

iter_args are typically also used in the loop body. It doesn’t matter for your canonicalization pattern. You only need to check whether there is a use in the parallel_insert_slice. Potential other uses of the iter_arg ar irrelevant.

> I still dont think the semantics of the operation is well defined when a shared-outs is used anywhere outside of the scf.forall.in_parallel region.

Are you talking about the init value (operand of the script.forall) or the iter_arg (bbarg of the region)? If you mean the former, you are right, such IR does usually not make sense. It is supported (will not misbufferize), but a copy will be inserted. The latter (bbarg) is expected to appear in the loop body. The bbarg models the "future buffer of the tensor that is being updated in parallel". The terminator (parallel_insert_slice) just makes sure that the "updated" tensor is written back. Kind of like an "extract_slice, computation, insert_slice" pattern after tiling: the insert_slice doesn’t really do anything is just needed so that the computation result has a use and does not fold away.


https://github.com/llvm/llvm-project/pull/90189