[all-commits] [llvm/llvm-project] 5417a5: [mlir][ArmSME] Add rudimentary support for tile sp...

Fri Jan 12 06:51:59 PST 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 5417a5fed6e1e026fa040de2e83872fa9fa1a443
      https://github.com/llvm/llvm-project/commit/5417a5fed6e1e026fa040de2e83872fa9fa1a443
  Author: Benjamin Maxwell <benjamin.maxwell at arm.com>
  Date:   2024-01-12 (Fri, 12 Jan 2024)

  Changed paths:
    M mlir/include/mlir/Dialect/ArmSME/IR/ArmSME.h
    M mlir/include/mlir/Dialect/ArmSME/IR/ArmSMEOps.td
    M mlir/lib/Conversion/ArmSMEToLLVM/ArmSMEToLLVM.cpp
    M mlir/lib/Dialect/ArmSME/IR/Utils.cpp
    M mlir/lib/Dialect/ArmSME/Transforms/TileAllocation.cpp
    A mlir/test/Conversion/ArmSMEToLLVM/tile-spills-and-fills.mlir
    M mlir/test/Dialect/ArmSME/tile-allocation.mlir
    A mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/use-too-many-tiles.mlir

  Log Message:
  -----------
  [mlir][ArmSME] Add rudimentary support for tile spills to the stack (#76086)

This adds very basic (and inelegant) support for something like spilling
and reloading tiles, if you use more SME tiles than physically exist.

This is purely implemented to prevent the compiler from aborting if a
function uses too many tiles (i.e. due to bad unrolling), but is
expected to perform very poorly.

Currently, this works in two stages:

During tile allocation, if we run out of tiles instead of giving up, we
switch to allocating 'in-memory' tile IDs. These are tile IDs that start
at 16 (which is higher than any real tile ID). A warning will also be
emitted for each (root) tile op assigned an in-memory tile ID:

```
warning: failed to allocate SME virtual tile to operation, all tile operations will go through memory, expect degraded performance
```

Everything after this works like normal until `-convert-arm-sme-to-llvm`

Here the in-memory tile op:

```mlir
arm_sme.tile_op { tile_id = <IN MEMORY TILE> }
```

Is lowered to:

```mlir
// At function entry:
%alloca = memref.alloca ... : memref<?x?xty>

// Around the op:
// Swap the contents of %alloca and tile 0.
scf.for %slice_idx {
  %current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
  "arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx)  <{tile_id = 0 : i32}>
  vector.store %current_slice, %alloca[%slice_idx, %c0]
}
// Execute op using tile 0.
arm_sme.tile_op { tile_id = 0 }
// Swap the contents of %alloca and tile 0.
// This restores tile 0 to its original state.
scf.for %slice_idx {
  %current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
  "arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx)  <{tile_id = 0 : i32}>
  vector.store %current_slice, %alloca[%slice_idx, %c0]
}
```

This is inserted during the lowering to LLVM as spilling/reloading
registers is a very low-level concept, that can't really be modeled
correctly at a high level in MLIR.

Note: This is always doing the worst case full-tile swap. This could be
optimized to only spill/load data the tile op will use, which could be
just a slice. It's also not making any use of liveness, which could
allow reusing tiles. But these is not seen as important as correct code
should only use the available number of tiles.