[all-commits] [llvm/llvm-project] 3bb969: [flang] Inline hlfir.matmul[_transpose]. (#122821)

Wed Jan 15 08:43:18 PST 2025

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 3bb969f3ebb25037e8eb69c30a5a0dfb5d9d0f51
      https://github.com/llvm/llvm-project/commit/3bb969f3ebb25037e8eb69c30a5a0dfb5d9d0f51
  Author: Slava Zakharin <szakharin at nvidia.com>
  Date:   2025-01-15 (Wed, 15 Jan 2025)

  Changed paths:
    M flang/include/flang/Optimizer/Builder/FIRBuilder.h
    M flang/include/flang/Optimizer/Builder/HLFIRTools.h
    M flang/include/flang/Optimizer/HLFIR/Passes.td
    M flang/lib/Optimizer/Builder/FIRBuilder.cpp
    M flang/lib/Optimizer/Builder/HLFIRTools.cpp
    M flang/lib/Optimizer/HLFIR/Transforms/SimplifyHLFIRIntrinsics.cpp
    M flang/lib/Optimizer/Passes/Pipelines.cpp
    M flang/test/Driver/mlir-pass-pipeline.f90
    M flang/test/Fir/basic-program.fir
    A flang/test/HLFIR/simplify-hlfir-intrinsics-matmul.fir

  Log Message:
  -----------
  [flang] Inline hlfir.matmul[_transpose]. (#122821)

Inlining `hlfir.matmul` as `hlfir.eval_in_mem` does not allow
to get rid of a temporary array in many cases, but it may still be
much better allowing to:
  * Get rid of any overhead related to calling runtime MATMUL
    (such as descriptors creation).
  * Use CPU-specific vectorization cost model for matmul loops,
    which Fortran runtime cannot currently do.
  * Optimize matmul of known-size arrays by complete unrolling.

One of the drawbacks of `hlfir.eval_in_mem` inlining is that
the ops inside it with store memory effects block the current
MLIR CSE, so I decided to run this inlining late in the pipeline.
There is a source commen explaining the CSE issue in more detail.

Straightforward inlining of `hlfir.matmul` as an `hlfir.elemental`
is not good for performance, and I got performance regressions
with it comparing to Fortran runtime implementation. I put it
under an enigneering option for experiments.

At the same time, inlining `hlfir.matmul_transpose` as `hlfir.elemental`
seems to be a good approach, e.g. it allows getting rid of a temporay
array in cases like: `A(:)=B(:)+MATMUL(TRANSPOSE(C(:,:)),D(:))`.

This patch improves performance of galgel and tonto a little bit.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications