[llvm] [AMDGPU][True16][CodeGen] Support AND/OR/XOR and LDEXP True16 format (PR #102620)

Mon Aug 12 10:58:04 PDT 2024

================
@@ -236,5 +243,38 @@ bool GCNPreRAOptimizations::runOnMachineFunction(MachineFunction &MF) {
     Changed |= processReg(Reg);
   }
 
+  if (!ST.useRealTrue16Insts())
+    return Changed;
+
+  // Add RA hints to improve True16 COPY elimination.
----------------
Sisyph wrote:

Definitely a separate change.

I would say the core problem is to teach the compiler that a COPY of a 16-bit value from a 32 bit register to a lo-half 16 bit register is free, to a hi-half 16 bit register is not. The allocation order of 16 bit registers is vgpr0lo16, vgpr0hi16, vgpr1lo16, vgpr1hi16, vgpr2lo16.... We prefer (essentially require) that allocation order, because it uses the minimum number of registers. But when you have 16 bit data passing between 16 and 32 bit instructions you get lots of COPY, without some treatment. For example, the calling convention putting 16 bit values only into the low halves of vgprs causes lots of COPY to the hi half. 

I believe it is worth an exploration of an improvement to coalescing, to improve additional cases, and perhaps as an alternative to the RA hints. I don't know how much work that would entail. In the interest of prioritizing upstreaming, I would suggest committing the RA hints, then removing them later if a coalescing change can do better.

https://github.com/llvm/llvm-project/pull/102620