[Mlir-commits] [mlir] [mlir][linalg] Fix for bias handling for Winograd (PR #110331)

Mon Sep 30 13:37:38 PDT 2024

================
@@ -837,9 +837,25 @@ Value outputTransform(RewriterBase &rewriter, Location loc, Value value,
     Value widthOffset =
         builder.create<affine::AffineApplyOp>(loc, affineMap, tileWIter);
 
+    // Handling bias.
+    Value prevVal =
+        extract2DDataFrom4D(builder, loc, args[0], NIter, FIter, heightOffset,
+                            widthOffset, retRows, retCols,
+                            /*loopNorFIdx=*/0,
+                            /*loopCorFIdx=*/3, /*heightIdx=*/1,
+                            /*widthIdx=*/2);
+    Value biasedVal =
+        builder
+            .create<linalg::AddOp>(
+                loc, prevVal.getType(), ValueRange{matmulRetValue, prevVal},
+                ValueRange{builder.create<tensor::EmptyOp>(
+                    loc, llvm::cast<ShapedType>(prevVal.getType()).getShape(),
+                    elementType)})
+            .getResult(0);
+
----------------
Max191 wrote:

It would be good to not generate lots of extra ops when possible. I see that having the `scalarFactor`s prevents using the init slice as the init value for the last matmul, but when there is no `scalarFactor`, it would be good to directly use the init slice as the out argument for the last matmul.

Also, in the case where there is a `scalarFactor`, the broadcast + mul + add could all be combined into a single linalg.generic. Something like:
```
%res = linalg.generic {
    indexing_maps = [
      affine_map<(d0, d1) -> (d0, d1)>,
      affine_map<(d0, d1) -> ()>,
      affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins(%a, %b : tensor<...>, f32) outs(%init_slice : tensor<...>) {
^bb0(%in: f32, %in_0: f32, %out: i32):
  %5 = arith.muli %in, %in_0 : f32
  %6 = arith.addi %6, %out : f32
  linalg.yield %6 : f32
} -> tensor<...>
```

https://github.com/llvm/llvm-project/pull/110331