[Mlir-commits] [mlir] [MLIR][XeGPU] Scattered ops sg-to-wi distribution (PR #154949)
Charitha Saumya
llvmlistbot at llvm.org
Mon Aug 25 09:55:28 PDT 2025
charithaintc wrote:
I did some quick testing. my conclusion is that we don't have to care about how the offset is defined. It will be taken care by the framework (unless it is produced by some op that is not supported, in which case we need to add support).
Example 1: (Trivially distributable)
```
func.func @lane_dependent_warp_propagate_read(
%src: memref<1024xf32>, %dest: memref<1024xf32>) {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%laneid = gpu.lane_id
%r = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>) {
%2 = arith.constant dense<0.0> : vector<32xf32>
gpu.yield %2 : vector<32xf32>
}
vector.transfer_write %r, %dest[%laneid] : vector<1xf32>, memref<1024xf32>
return
}
```
To
```
func.func @lane_dependent_warp_propagate_read(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {
%cst = arith.constant dense<0.000000e+00> : vector<1xf32>
%0 = gpu.lane_id
vector.transfer_write %cst, %arg1[%0] : vector<1xf32>, memref<1024xf32>
return
}
```
Example 2 (complicated case).
```
func.func @lane_dependent_warp_propagate_read(
%src: memref<1024xf32>, %dest: memref<1024xf32>) {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%laneid = gpu.lane_id
%r = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>) {
// %2 = arith.constant dense<0.0> : vector<32xf32>
%2 = arith.constant dense<[0.0, 1.0, 3.0, 4.0, 0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0,0.0, 1.0, 3.0, 4.0]> : vector<32xf32>
gpu.yield %2 : vector<32xf32>
}
vector.transfer_write %r, %dest[%laneid] : vector<1xf32>, memref<1024xf32>
return
```
To
```
func.func @lane_dependent_warp_propagate_read(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>) {
%cst = arith.constant dense<[0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00, 0.000000e+00, 1.000000e+00, 3.000000e+00, 4.000000e+00]> : vector<32xf32>
%0 = gpu.lane_id
%1 = gpu.warp_execute_on_lane_0(%0)[32] -> (vector<1xf32>) {
gpu.yield %cst : vector<32xf32>
}
vector.transfer_write %1, %arg1[%0] : vector<1xf32>, memref<1024xf32>
return
}
```
I agree, for the complex case broadcasting is needed indeed. But I guess this is outside the scope of gather/scatter distribution. It should not care about it.
https://github.com/llvm/llvm-project/pull/154949
More information about the Mlir-commits
mailing list