[llvm] [AMDGPUInstCombineIntrinsic] Do not narrow amdgcn_s_buffer_load intrinsic to 8,16-bit variants (PR #117997)

Mon Dec 16 07:15:41 PST 2024

jmmartinez wrote:

> > Do not narrow 8,16-bit amdgcn_s_buffer_load instrinsics
> 
> The wording is a bit strange since it would not make sense to narrow an 8-bit load anyway.
> 
> Typo "instrinsic".

Argh! Sorry, the description was completely backwards. I've fixed these in the commit.

> Why is this only for s_buffer_load, not VMEM buffer_load?

Good question. I haven't seen a mention of the dword granularity of the checks in the documentation. But it's also not clear for sizes smaller than dword.

The GCN gen3 manual says (In the notes of section 8.1.5.1):
> c. Load/store-format-* instruction and atomics are range-checked "all or nothing" -- either entirely in our out.
> d. Load/store-dword-x{2,3,4} and range-check per component.

-->  I assume that the "and" in _d._ is a typo and should be "are".

It seems that it might be the case for any size of `(load/store)-format-*`, so we should stop the narrowing of these.

> > We can still narrow this:
> > ```assembly
> >   %data = call <4 x half> @llvm.amdgcn.s.buffer.load.v4f16(<4 x i32> %rsrc, i32 %ofs, i32 0)
> >   %elt1 = extractelement <4 x half> %data, i32 0
> >   ret half %elt1
> > ```
> > 
> > Into this (narrowing the load from <4 x half> to <2 x half> and keeping the extractelement):
> > ```assembly
> >   %data = call <2 x half> @llvm.amdgcn.s.buffer.load.v2f16(<4 x i32> %rsrc, i32 %ofs, i32 0)
> >   %elt1 = extractelement <2 x half> %data, i32 0
> >   ret half %elt1
> > ```
> 
> Are you saying that narrowing is OK if the offset does not need to be updated? Does your patch implement that, or is it a future improvement?

Narrowing should be ok if the elements being extracted are all in the same dword, but since I'm not 100% sure about how the dword granularity works in detail I have not implemented this transformation.

https://github.com/llvm/llvm-project/pull/117997