[Mlir-commits] [mlir] [mlir][ArmSME] Lower vector.outerproduct to FMOPA/BFMOPA (PR #65621)

Fri Sep 8 01:53:19 PDT 2023

================
@@ -0,0 +1,93 @@
+// DEFINE: %{entry_point} = test_outerproduct_4x4xf32
+// DEFINE: %{compile} = mlir-opt %s \
+// DEFINE:   -enable-arm-streaming="mode=locally enable-za" \
+// DEFINE:   -convert-vector-to-arm-sme -convert-arm-sme-to-scf \
+// DEFINE:   -convert-vector-to-llvm="enable-arm-sme" -cse -canonicalize \
+// DEFINE:   -allocate-arm-sme-tiles -test-lower-to-llvm
+// DEFINE: %{run} = %mcr_aarch64_cmd \
+// DEFINE:   -march=aarch64 -mattr=+sve,+sme \
+// DEFINE:   -e %{entry_point} -entry-point-result=void \
+// DEFINE:   -shared-libs=%mlir_runner_utils,%mlir_c_runner_utils
+
+// RUN: %{compile} | %{run} | FileCheck %s
+
+// REDEFINE: %{entry_point} = test_outerproduct_no_accumulator_4x4xf32
+// RUN: %{compile} | %{run} | FileCheck %s --check-prefix=CHECK-NO-ACC
+
+func.func @test_outerproduct_4x4xf32() {
+  %c0 = arith.constant 0 : index
+  %f1 = arith.constant 1.0 : f32
+  %f2 = arith.constant 2.0 : f32
+  %f10 = arith.constant 10.0 : f32
+
+  %a = vector.splat %f1 : vector<[4]xf32>
+  %b = vector.splat %f2 : vector<[4]xf32>
+  // TODO: vector.splat doesn't support ArmSME.
+  %c = vector.broadcast %f10 : f32 to vector<[4]x[4]xf32>
+
+  %tile = vector.outerproduct %a, %b, %c : vector<[4]xf32>, vector<[4]xf32>
+
+  // Calculate the size of a 32-bit tile, e.g. ZA{n}.s.
+  %vscale = vector.vscale
+  %min_elts_s = arith.constant 4 : index
+  %svl_s = arith.muli %min_elts_s, %vscale : index
+  %za_s_size = arith.muli %svl_s, %svl_s : index
+
+  // Allocate memory.
+  %mem = memref.alloca(%za_s_size) : memref<?xf32>
+
+  // Store the tile to memory.
+  vector.store %tile, %mem[%c0] : memref<?xf32>, vector<[4]x[4]xf32>
+
+  // Reload and print. The smallest SVL is 128-bits so the tile will be at
+  // least 4x4xf32.
+  //
+  // CHECK:      ( 12, 12, 12, 12
+  // CHECK-NEXT: ( 12, 12, 12, 12
+  // CHECK-NEXT: ( 12, 12, 12, 12
+  // CHECK-NEXT: ( 12, 12, 12, 12
----------------
c-rhodes wrote:

I copied the example from `mlir/test/Integration/Dialect/Vector/CPU/test-outerproduct-f32.mlir` but i agree what you suggested is better, done. I've left f64 as is tho for a bit of variation, and also because it's small enough that stepvector doesn't give a particularly interesting outpu.

https://github.com/llvm/llvm-project/pull/65621