[Mlir-commits] [mlir] [MLIR][XeGPU] Scattered ops sg-to-wi distribution (PR #154949)

Mon Aug 25 09:55:28 PDT 2025

charithaintc wrote:

I did some quick testing. my conclusion is that we don't have to care about how the offset is defined. It will be taken care by the framework (unless it is produced by some op that is not supported, in which case we need to add support).

Example 1: (Trivially distributable)
```
func.func @lane_dependent_warp_propagate_read(
    %src: memref<1024xf32>, %dest: memref<1024xf32>) {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %laneid = gpu.lane_id
  %r = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>) {
    %2 = arith.constant dense<0.0> : vector<32xf32>
    gpu.yield %2 : vector<32xf32>
  }
  vector.transfer_write %r, %dest[%laneid] : vector<1xf32>, memref<1024xf32>
  return
}

```
To
```
  func.func @lane_dependent_warp_propagate_read(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {
    %cst = arith.constant dense<0.000000e+00> : vector<1xf32>
    %0 = gpu.lane_id
    vector.transfer_write %cst, %arg1[%0] : vector<1xf32>, memref<1024xf32>
    return
  }
```

Example 2 (complicated case).
```
func.func @lane_dependent_warp_propagate_read(
    %src: memref<1024xf32>, %dest: memref<1024xf32>) {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %laneid = gpu.lane_id
  %r = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>) {
    // %2 = arith.constant dense<0.0> : vector<32xf32>
    %2 = arith.constant dense<[0.0, 1.0, 3.0, 4.0, 0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0]> : vector<32xf32>
    gpu.yield %2 : vector<32xf32>
  }
  vector.transfer_write %r, %dest[%laneid] : vector<1xf32>, memref<1024xf32>
  return
```
To
```
  func.func @lane_dependent_warp_propagate_read(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {
    %cst = arith.constant dense<[0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00]> : vector<32xf32>
    %0 = gpu.lane_id
    %1 = gpu.warp_execute_on_lane_0(%0)[32] -> (vector<1xf32>) {
      gpu.yield %cst : vector<32xf32>
    }
    vector.transfer_write %1, %arg1[%0] : vector<1xf32>, memref<1024xf32>
    return
  }
```

I agree, for the complex case broadcasting is needed indeed. But I guess this is outside the scope of gather/scatter distribution. It should not care about it. 

https://github.com/llvm/llvm-project/pull/154949