[Mlir-commits] [mlir] [MLIR][XeGPU] Scattered ops sg-to-wi distribution (PR #154949)

Mon Aug 25 09:00:55 PDT 2025

charithaintc wrote:

> > Instead I think offsets and masks must be distributed.
> 
> This is the main difference between scattered ops and nd ops.
> 
> 1. We do not have an intrinsic beneath these ops, that would provide clear rules (i.e., describe the structure) of a load/store.
> 2. We do not have a single offset that defines a base pointer for a 2D shape whose structure we could describe using a layout attribute.
> 
> **The offsets are the layout**, and they are not necessarily linear (w.r.t. lane id) or compile time defined.
> 
> The documentation does not prevent me from supplying a completely unstructured vector of offsets (e.g., `[0, 5, 2, 11, 1]`), it only says that the op needs SG-size vector of offsets:
> 
> > * `offsets`: represents offsets from source. required if `source` in not a TensorDescType.
> >   offsets is a vector of `index` type and vector length is either the subgroup size
> >   or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
> 
> Therefore, we cannot "distribute" such vector based on `lane_layout = [1, N]`. How would that look for the above unstructured vector? And if we could distribute, why do we need _a vector_ of offsets at SG level?
> 
> The same applies to the mask, one can supply any random vector of `i1`. How do we convey that lane 0 is 0, but lane 3 is 1 in pure SIMT?
> 
> The offsets/mask vectors are not SG-uniform, they are allowed to be unstructured, and they can be completely runtime defined. What is distribution supposed to do with them at compile time, in your opinion?

When I say "offsets are distributed" it does not mean we have to describe them as some affine function of laneID. I meant is the vector<16xindex> will become vector<1xindex>.

And then each lane can extract the scalar value from this <1xindex> vector. Let me give an example.

Before.
```
%offsets = arith.constant dense<0> : vector<16xindex>
// insert any value to this vector (random, linear does not matter)
%v = xegpu.load %base [%offset] : i64, vector<16xindex> -> vector<16xf16>
```
After SIMT distribution.
```
%offsets = arith.constant dense<0> : vector<1xindex>
// insert any value to this vector (random, linear does not matter)
%scalar_offset = vector.extract %offset[0] : index
%v = xegpu.load %base [%scalar_offset] : i64, index-> vector<1xf16>
```

Can you please explain why such strategy would not work?

If instead if we broadcast the offsets, we are wasting a lot of registers plus broadcasting need cross-lane comm. 

https://github.com/llvm/llvm-project/pull/154949