[PATCH] D72129: [mlir] Add in-dialect lowering of gpu.all_reduce.

Mon Jan 6 02:54:11 PST 2020

ftynse marked an inline comment as done.
ftynse added a comment.

Wrong clang-tidy checks are annoying here, will make another pass later.

================
Comment at: mlir/include/mlir/Dialect/GPU/GPUOps.td:163
+      auto attr = getAttrOfType<IntegerAttr>(attrName);
+      setAttr(attrName, IntegerAttr::get(attr.getType(), attr.getValue()+1));
+      return getBody().front().insertArgument(
----------------
Style nit: add whitespace around "+"

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:29
+
+  explicit GpuAllReduceRewriter(gpu::GPUFuncOp funcOp_,
+                                gpu::AllReduceOp reduceOp_,
----------------
Nit: `explicit` is unnecessary for multi-argument constructors.

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:39
+  ///
+  /// First reduce the elements within a warp. The first thread of each warp
+  /// writes the intermediate result to shared memory. After synchronizing the
----------------
Nit: can we use "neutral" terminology: shared memory -> workgroup memory, block -> workgroup ?

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:47
+  ///   ^then1:
+  ///     store %warp_reduce, %workgroup_buffer, %warp_id
+  ///     br ^continue1
----------------
Should this be `store %warp_reduce, %workgroup_buffer[%warp_id]`?
Same below for loads and stores

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:52
+  ///     %is_valid_warp = cmpi "slt" %thread_idx, %num_warps
+  ///     cond_br %is_first_lane, ^then2, ^continue2
+  ///   ^then2:
----------------
Should this be `cond_br %is_valid_warp` instead?

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:67
+    // Compute linear thread index and block size.
+    Value *dimX = getDimOp<gpu::BlockDimOp>("x");
+    Value *dimY = getDimOp<gpu::BlockDimOp>("y");
----------------
Big and mechanical change: `Value` is now value-based, please use it without pointers.

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:81
+    // Compute lane id (invocation id withing the subgroup).
+    Value *warpMask = create<ConstantIntOp>(loc, kWarpSize - 1, 32);
+    Value *laneId = create<AndOp>(loc, threadIdx, warpMask);
----------------
Style nit: add a comment `/*bitwidth=*/` for `32`

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:88
+    // The number of active threads in the warp, not clamped to 32.
+    Value *activeWidth =
+        create<SubIOp>(loc, blockSize, numThreadsWithSmallerWarpId);
----------------
Not sure I follow here. `numThreadsWithSmallerWarpId` is the equivalent of `warp_id * warp_size`, or `floor(linear_thread_id, warp_size) * warp_size`. Subtracting that from `blockSize` will give you the number of threads in all warps starting from the current warp. From skimming through the consumer (createWarpReduce), it looks like what it expects is the number of threads in the _current_ warp.

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:104
+    // each warp.
+    createPredicatedBlock(isFirstLane, [&] {
+      Value *warpId = getDivideByWarpSize(threadIdx);
----------------
This looks like you want EDSC :)

+ @nicolasvasilache 

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:138
+  template <typename T, typename... Args> T create(Args... args) {
+    return rewriter.create<T>(std::forward<Args>(args)...);
+  }
----------------
Maybe you can store the location and pass it as first argument? It's the same for all operations you create above.

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:150
+    auto bufferType =
+        MemRefType::get({kWarpSize}, valueType, ArrayRef<AffineMap>{}, 3);
+    return funcOp.addWorkgroupAttribution(bufferType);
----------------
Nit: let's have a named constant for the address space 

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:201
+  AccumulatorFactory getFactory(StringRef opName) {
+    bool isFloatingPoint = valueType.isF32();
+    if (opName == "add")
----------------
Why only `F32`? Could you have `isa<FloatType>()` instead?

================
Comment at: mlir/lib/Dialect/GPU/Transforms/AllReduceLowering.cpp:243
+
+    auto addBranch = [&](ValueRange operands) {
+      create<BranchOp>(loc, continueBlock, operands);
----------------
Is this lambda worse it? 

================
Comment at: mlir/test/Dialect/GPU/all-reduce.mlir:7
+  gpu.func @kernel(%arg0 : f32) attributes { gpu.kernel } {
+    // CHECK:   %[[vc0_i32:[a-z_0-9]+]] = constant 0 : i32
+    // CHECK:   %[[vc31_i32:[a-z_0-9]+]] = constant 31 : i32
----------------
ftynse wrote:
> csigg wrote:
> > These CHECKs were generated from the output with:
> > 
> > ```
> > sed -r \
> > -e 's|\t+//.*||' \
> > -e 's|%([a-z_0-9]+) = |%[[v\1:[a-z_0-9]+]] = |' \
> > -e 's|\(%([a-z_0-9]+): ([a-z_0-9]+)\):|(%[[v\1:[a-z_0-9]+]]: \2):|' \
> > -e 's|%([a-z_0-9]+)|%[[v\1]]|g' \
> > -e 's|bb([0-9]+)|bb[[#b\1]]|g' \
> > -e 's|^ |    // CHECK:|'
> > ```
> > 
> > and manaul edits from there.
> > 
> We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it?
Is there a reason why you can't use `.*` as a regexp for names? Seeing the full with ranges makes it hard to read the test. I think we also support capital letters in the names.

================
Comment at: mlir/test/Dialect/GPU/all-reduce.mlir:7-176
+    // CHECK:   %[[vc0_i32:[a-z_0-9]+]] = constant 0 : i32
+    // CHECK:   %[[vc31_i32:[a-z_0-9]+]] = constant 31 : i32
+    // CHECK:   %[[vc0:[a-z_0-9]+]] = constant 0 : index
+    // CHECK:   %[[vc32_i32:[a-z_0-9]+]] = constant 32 : i32
+    // CHECK:   %[[vc1_i32:[a-z_0-9]+]] = constant 1 : i32
+    // CHECK:   %[[vc2_i32:[a-z_0-9]+]] = constant 2 : i32
+    // CHECK:   %[[vc4_i32:[a-z_0-9]+]] = constant 4 : i32
----------------
csigg wrote:
> These CHECKs were generated from the output with:
> 
> ```
> sed -r \
> -e 's|\t+//.*||' \
> -e 's|%([a-z_0-9]+) = |%[[v\1:[a-z_0-9]+]] = |' \
> -e 's|\(%([a-z_0-9]+): ([a-z_0-9]+)\):|(%[[v\1:[a-z_0-9]+]]: \2):|' \
> -e 's|%([a-z_0-9]+)|%[[v\1]]|g' \
> -e 's|bb([0-9]+)|bb[[#b\1]]|g' \
> -e 's|^ |    // CHECK:|'
> ```
> 
> and manaul edits from there.
> 
We have https://github.com/llvm/llvm-project/blob/master/mlir/utils/generate-test-checks.py, have you tried it?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D72129/new/

https://reviews.llvm.org/D72129