[llvm] [AMDGPU] Optimize image sample followed by llvm.amdgcn.cvt.pkrtz into d16 variant (PR #145203)

Fri Jul 4 01:22:59 PDT 2025

================
@@ -247,6 +247,42 @@ simplifyAMDGCNImageIntrinsic(const GCNSubtarget *ST,
                                        ArgTys[0] = User->getType();
                                      });
         }
+
+        // Fold image.sample + cvt.pkrtz -> extractelement idx0 into a single
+        // d16 image sample.
+        // Pattern to match:
+        //   %sample = call float @llvm.amdgcn.image.sample...
+        //   %pack = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %sample,
+        //   float %any)
+        //   %low = extractelement <2 x half> %pack, i64 0
+        // Replacement:
+        //   call half @llvm.amdgcn.image.sample
----------------
harrisonGPU wrote:

Thanks, Jay. I’ve thought about the cases you mentioned.
> What if both inputs of cvt.pkrtz come from image.sample instructions?

Did you mean something like this?
```llvm
define amdgpu_ps float @image_sample_2d_single_pkrtz_two_sample_no_d16(<8 x i32> %surf_desc, <4 x i32> %samp, float %u, float %v) {
entry:
  %sample1 = call float @llvm.amdgcn.image.sample.lz.2d.f32.f32.v8i32.v4i32(i32 2, float %u, float %v, <8 x i32> %surf_desc, <4 x i32> %samp, i1 false, i32 0, i32 0)
  %sample2 = call float @llvm.amdgcn.image.sample.lz.2d.f32.f32.v8i32.v4i32(i32 2, float %u, float %v, <8 x i32> %surf_desc, <4 x i32> %samp, i1 false, i32 0, i32 0)
  %pack = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %sample1, float %sample2)
  %h0 = extractelement <2 x half> %pack, i64 0
  %h1 = extractelement <2 x half> %pack, i64 1
  %mul = fmul half %h0, %h1
  %div = fdiv half %mul, %h0
  %add = fadd half %div, %h1
  %res = fpext half %add to float
  ret float %res
}
```
But in fact LLPC only uses the low half, you can find the details in LLPC:
```cpp
m_builder->CreateFpTruncWithRounding(inst->getOperand(0),.....
```
so the second operand is usually a constant 0.0.

> What if image.sample returns <2 x float> or <4 x float> and all values are converted to f16?

It’s a similar case. Only the first result of each pkrtz is used. For example:
```llvm
define amdgpu_ps float @image_sample_2d_single_pkrtz_d16(<8 x i32> %surf_desc, <4 x i32> %samp, i32 %u, i32 %v) {
entry:
  %0 = call reassoc arcp contract afn <4 x float> @llvm.amdgcn.image.load.2d.v4f32.i32.v8i32(i32 15, i32 %u, i32 %v, <8 x i32> %surf_desc, i32 0, i32 0)
  %1 = extractelement <4 x float> %0, i64 3
  %2 = extractelement <4 x float> %0, i64 2
  %3 = extractelement <4 x float> %0, i64 1
  %4 = extractelement <4 x float> %0, i64 0
  %5 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %4, float 0.000000e+00)
  %6 = extractelement <2 x half> %5, i64 0
  %7 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %3, float 0.000000e+00)
  %8 = extractelement <2 x half> %7, i64 0
  %9 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %2, float 0.000000e+00)
  %10 = extractelement <2 x half> %9, i64 0
  %11 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %1, float 0.000000e+00)
  %12 = extractelement <2 x half> %11, i64 0
  %mul1 = fmul reassoc arcp contract afn half %6, %8
  %mul2 = fmul reassoc arcp contract afn half %10, %12
  %add = fadd reassoc arcp contract afn half %mul1, %mul2
  %res = fpext half %add to float
  ret float %res
}
```
I plan to support this case in a follow-up patch after some refactoring: :https://github.com/llvm/llvm-project/pull/145312#issuecomment-2996144139.

Do we have a `@llvm.amdgcn.cvt.rtz(float)` intrinsic?  `@llvm.amdgcn.cvt.pkrtz` requires returning `<2 x half>`, so it’s not usable for scalar half values directly, right?


https://github.com/llvm/llvm-project/pull/145203