[Mlir-commits] [mlir] [MLIR][XeGPU] Add uArch limitation to scatter load store (PR #172845)
Artem Kroviakov
llvmlistbot at llvm.org
Thu Jan 22 04:22:07 PST 2026
================
@@ -962,41 +962,77 @@ void LayoutInfoPropagation::visitLoadGatherOp(
LayoutInfo loadLayout;
LayoutInfo maskLayout;
+ auto uArch = getUArch(getChipStr(load).value_or(""));
+ const int subgroupSize = uArch->getSubgroupSize();
xegpu::DistributeLayoutAttr anchorLayout = load.getLayoutAttr();
if (hasParamsOfLayoutKind(anchorLayout)) {
loadLayout = LayoutInfo(anchorLayout);
- maskLayout = loadLayout;
} else {
+ LayoutInfo valueLayout = results[0]->getValue();
+ // Need the layout of the value to propagate to the tensor descriptor.
+ if (!valueLayout.isAssigned())
+ return;
+
+ auto resAttr = dyn_cast<xegpu::DistributeLayoutAttr>(valueLayout.get());
+ auto instDataIncoming = resAttr.getEffectiveInstDataAsInt();
+ if (auto sliceAttr = dyn_cast<xegpu::SliceAttr>(resAttr))
+ instDataIncoming = SmallVector<int64_t>(
+ cast<xegpu::LayoutAttr>(sliceAttr.flatten().getParent())
+ .getInstData()
+ .asArrayRef());
- // The layout is strictly determined by the payload type.
VectorType payloadTy = load.getValueType();
if (!payloadTy) {
load.emitWarning("Not propagating, non-vector payload supplied.");
return;
}
- auto uArch = getUArch(getChipStr(load).value_or(""));
- const int subgroupSize = uArch->getSubgroupSize();
- SmallVector<int> instData{subgroupSize};
- if (auto chunkSize = load.getChunkSize().value_or(0); chunkSize > 1)
- instData.push_back(chunkSize);
- else if (auto srcTdescTy =
- dyn_cast<xegpu::TensorDescType>(load.getSourceType())) {
- if (srcTdescTy.getChunkSizeAsInt() > 1)
- instData.push_back(chunkSize);
- }
+ const auto *uArchInstruction =
+ dyn_cast<xegpu::uArch::LoadGatherInstruction>(
+ uArch->getInstruction(xegpu::uArch::InstructionKind::LoadGather));
- if (layoutKind == LayoutKind::InstData)
- loadLayout =
- LayoutInfo(xegpu::LayoutAttr::get(load.getContext(), instData));
- else
- loadLayout = getSIMTLayoutInforForScatterIO(
- payloadTy, uArch, uArch->getGeneralPackedFormatBitSize());
+ // Check if value inst_data complies with uArch
+ if (layoutKind == LayoutKind::InstData) {
+ // Each lane loads either one element
+ SmallVector<int> instDataUarch{subgroupSize};
+ // Or multiple elements as 2D with lane's elements in the inner dimension
+ if (payloadTy.getRank() != 1) {
+ if (payloadTy.getRank() != 2) {
+ load.emitWarning("Expected 2D payload for LoadGatherOp.");
+ return;
+ }
+ instDataUarch.push_back(
+ (std::min(static_cast<int>(payloadTy.getShape().back()),
+ uArchInstruction->getMaxLaneLoadStoreSize())));
+ }
+ // If inst data does not match, enforce the uArch-based one
+ if (!llvm::equal(instDataIncoming, instDataUarch)) {
+ xegpu::LayoutAttr sourceAttr = dyn_cast<xegpu::LayoutAttr>(resAttr);
+ if (auto sliceAttr = dyn_cast<xegpu::SliceAttr>(resAttr)) {
+ sourceAttr = cast<xegpu::LayoutAttr>(sliceAttr.flatten().getParent());
+ }
+ assert(sourceAttr);
+ xegpu::DistributeLayoutAttr updatedLayoutAttr = xegpu::LayoutAttr::get(
+ load.getContext(), sourceAttr.getSgLayout(), sourceAttr.getSgData(),
+ DenseI32ArrayAttr::get(load.getContext(), instDataUarch),
+ sourceAttr.getLaneLayout(), sourceAttr.getLaneData(),
+ sourceAttr.getOrder());
+
+ if (auto sliceAttr = dyn_cast<xegpu::SliceAttr>(resAttr))
+ updatedLayoutAttr = xegpu::SliceAttr::get(
+ load.getContext(), updatedLayoutAttr, sliceAttr.getDims());
+ valueLayout = LayoutInfo(updatedLayoutAttr);
+ }
+ }
+ loadLayout = valueLayout;
+ load.setLayoutAttr(dyn_cast<xegpu::DistributeLayoutAttr>(loadLayout.get()));
+ }
- // Mask operand should have 1D default layout.
+ if (layoutKind == LayoutKind::InstData)
+ maskLayout =
----------------
akroviakov wrote:
> mask/offset should be inferred from the loadLayout
The [doc](https://mlir.llvm.org/docs/Dialects/XeGPU/#xegpustore-xegpustorescatterop) says:
> mask is a vector of size equal to the subgroup size, or 1 at lane level.
> offsets is a vector of index type and vector length is either the subgroup size or 1 at lane level
So each `inst_data`-sized sg-level instruction after blocking should use a vector of subgroup size for both mask and offsets.
The doc also has an example `Example 2 (Subgroup level):` that shows `16x8` data with `layout = #xegpu.layout<lane_layout = [16, 1], lane_data = [1, 8]>` and `16xi1` mask. Would it be correct to simply reuse the given data layout also for mask?
https://github.com/llvm/llvm-project/pull/172845
More information about the Mlir-commits
mailing list