[clang] [llvm] [AMDGPU] Add global_load_tr for GFX12 (PR #77772)

Fri Jan 12 10:38:29 PST 2024

================
@@ -18178,6 +18178,51 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
     llvm::Function *F = CGM.getIntrinsic(IID, {ArgTy});
     return Builder.CreateCall(F, {Addr, Val, ZeroI32, ZeroI32, ZeroI1});
   }
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_v2i32:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_i32:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8i16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v8f16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4i16:
+  case AMDGPU::BI__builtin_amdgcn_global_load_tr_b128_v4f16: {
+
+    Intrinsic::ID IID;
+    llvm::Type *ArgTy;
+    switch (BuiltinID) {
+    case AMDGPU::BI__builtin_amdgcn_global_load_tr_b64_v2i32:
+      ArgTy = llvm::FixedVectorType::get(
+          llvm::Type::getInt32Ty(getLLVMContext()), 2);
+      IID = Intrinsic::amdgcn_global_load_tr_b64;
----------------
changpeng wrote:

> Initially I thought it was better to have _b64/_b128 explicit to avoid confusion as the number of bits loaded depends also on wave size. On the second thought, I believe that having just one intrinsic would be cleaner - will make an update.

This doesn't work when we have instructions that transposes to vectors of B8, B6 and B4. We could not differentiate when we use (2 x i32) to workaround at this moment. 

https://github.com/llvm/llvm-project/pull/77772