[Mlir-commits] [mlir] [MLIR][XeGPU] Scattered ops sg-to-wi distribution (PR #154949)

Charitha Saumya llvmlistbot at llvm.org
Mon Aug 25 09:55:28 PDT 2025


charithaintc wrote:

I did some quick testing. my conclusion is that we don't have to care about how the offset is defined. It will be taken care by the framework (unless it is produced by some op that is not supported, in which case we need to add support).

Example 1: (Trivially distributable)
```
func.func @lane_dependent_warp_propagate_read(
    %src: memref<1024xf32>, %dest: memref<1024xf32>) {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %laneid = gpu.lane_id
  %r = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>) {
    %2 = arith.constant dense<0.0> : vector<32xf32>
    gpu.yield %2 : vector<32xf32>
  }
  vector.transfer_write %r, %dest[%laneid] : vector<1xf32>, memref<1024xf32>
  return
}

```
To
```
  func.func @lane_dependent_warp_propagate_read(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {
    %cst = arith.constant dense<0.000000e+00> : vector<1xf32>
    %0 = gpu.lane_id
    vector.transfer_write %cst, %arg1[%0] : vector<1xf32>, memref<1024xf32>
    return
  }
```

Example 2 (complicated case).
```
func.func @lane_dependent_warp_propagate_read(
    %src: memref<1024xf32>, %dest: memref<1024xf32>) {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %laneid = gpu.lane_id
  %r = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>) {
    // %2 = arith.constant dense<0.0> : vector<32xf32>
    %2 = arith.constant dense<[0.0, 1.0, 3.0, 4.0, 0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0]> : vector<32xf32>
    gpu.yield %2 : vector<32xf32>
  }
  vector.transfer_write %r, %dest[%laneid] : vector<1xf32>, memref<1024xf32>
  return
```
To
```
  func.func @lane_dependent_warp_propagate_read(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {
    %cst = arith.constant dense<[0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00]> : vector<32xf32>
    %0 = gpu.lane_id
    %1 = gpu.warp_execute_on_lane_0(%0)[32] -> (vector<1xf32>) {
      gpu.yield %cst : vector<32xf32>
    }
    vector.transfer_write %1, %arg1[%0] : vector<1xf32>, memref<1024xf32>
    return
  }
```

I agree, for the complex case broadcasting is needed indeed. But I guess this is outside the scope of gather/scatter distribution. It should not care about it. 

https://github.com/llvm/llvm-project/pull/154949


More information about the Mlir-commits mailing list