[Mlir-commits] [mlir] [MLIR][XeGPU] Allow some nd ops to have argument shapes mismatch for … (PR #120566)

Thu Dec 19 16:01:46 PST 2024

Jianhui-Li wrote:

Based on the nature of tensor descriptor, I don't think we should distribute it.

The tensor descriptor is a uniform value which all SIMD lanes share.  There is only one copy of tensor descriptor being created for all SIMD lanes, and the creation doesn't involve lane id.  The 2d block load is a collective operation that each lane takes the uniform tensor descriptor and load/store its own data fragment.  The block shape size inside tensor descriptor can't be viewed exactly as the memref shape, which can naturally distribute and each individual thread computes its own address according to lane id.  Instead, the computation of tensor descriptor involves no lane id, and all lanes should compute one same value, so that at the assembly level the tensor descriptor creation is done by only 1 thread.  

If we distribute it, we need to reverse it back by merging the shape back to original block size and geting rid of the lane id from the computation. Also the distribution is non-trivial which makes the reversal process complex: the data fragment may have strides along 2 dimensions, so each thread may generate multiple addresses.  I don't see it is worth doing it since I don't see any optimization we want or missed on this kind of distributed form. 

>>This approach comes at the price of a vague IR-to-HW-constructs mapping: now the IR no longer represents something a single logical thread in SIMT owns and acts upon, so the abstraction layering is still broken.

To me, SIMT doesn't mean each thread must know exactly how to compute address using its own lane id.  This is actually what the HW ISA try to avoid, since pre-lane addresses uses more registers than one uniform tensor descriptor.  

>>This can potentially have some unexpected implications that we don't see today on other transformations and analyses (e.g., nd_load and nd_store are ops that should have memory side effects that are not implemented at the moment. It is unclear to me what implications this violation of "ownership" may have.).

 I don't quite understand the "memory side effects" and "violation of owership" to debate.  Maybe an example can help here.

 In worst case, most of XeGPU level optimization is target dependent so we will have to take care this inconsistent shape issue that is target specific. I don't expect many target-independent optimization which we want XeGPU to be distributed in perfect SIMT flavor, but only see problems if we go that way as stated above.  If you have example, you may point it out.   

https://github.com/llvm/llvm-project/pull/120566