[PATCH] D104847: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions

Mon Jul 12 04:22:10 PDT 2021

steffenlarsen added a comment.

Sorry for the late response. Looks like you have handled the issues and more in your patch. Thank you for fixing my blunders. 😄

================
Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:727
 TARGET_BUILTIN(__bmma_m8n8k128_ld_c, "vi*iC*UiIi", "", AND(SM_75,PTX63))
 TARGET_BUILTIN(__bmma_m8n8k128_mma_xor_popc_b1, "vi*iC*iC*iC*Ii", "", AND(SM_75,PTX63))
 TARGET_BUILTIN(__bmma_m8n8k128_st_c_i32, "vi*iC*UiIi", "", AND(SM_75,PTX63))
----------------
tra wrote:
> Bummer. mma.h in CUDA-11.3 still does not compile for Ampere.
> 
> We appear to be missing the new `__bmma_m8n8k128_mma_and_popc_b1` builtin for the `.and` variant of 1-bit `mma` introduced in PTX 7.1 and not included in this patch.
> 
> https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
> 
> Do you, by any chance, have upcoming patch for PTX7.1, too. :-) 
Haha, I didn't think of that one. Sadly we don't have any plans to work on extending support for PTX 7.1+ in the next couple of months, but it looks like your new patch (D105384) takes care of it anyway. 😆 

================
Comment at: clang/test/CodeGen/builtins-nvptx-mma.cu:781-786
+  // CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32
+  // expected-error-re at +1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature (sm_80{{.*}},(ptx70{{.*}}}}
+  __mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 1);
+  // CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.c.row.stride.f32
+  // expected-error-re at +1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature (sm_80{{.*}},(ptx70{{.*}}}}
+  __mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 0);
----------------
tra wrote:
> tra wrote:
> > This looks rather odd. We're calling a `tf32` builtin, but expect to see and `f32` load intrinsic. Is that expected ? 
> > 
> > 
> Never mind. I think I understand what's going on now.
> CUDA headers use  __mma_tf32 builtins. `A` and `B` operate on opaque integer types. `C` and `D` operate on floats.
> However, on the PTX front we have `wmma.load.{a,b}...tf32` but `wmma.load.c...f32`.
> 
> I guess it does make sense to keep LLVM intrinsic names close to the instructions they produce.
> 
> 
> 
Yeah, it was definitely confusing to write. I think the current state is the best solution, as it prioritizes consistency within the sub-projects. Not a big fan of the inconsistency though, but if we want to follow CUDA's example I suppose we're stuck with this.

================
Comment at: clang/test/CodeGen/builtins-nvptx-mma.py:74
+          make_ldst_ops(["m8n8k4"], ["a", "b", "c", "d"], ["f64"]) +
+          make_ldst_ops(["m16n16k8"], ["a", "b"], ["tf32"]) +
+          make_ldst_ops(["m16n16k8"], ["c", "d"], ["f32"]))
----------------
tra wrote:
> This does not seem to match the generated `builtins-nvptx-mma.cu` which does have `__mma_tf32_m16n16k8_ld_c`
> If I regenrate the test I see a somewhat different set of tests, possibly related to the oddity I've pointed in the generated test changes in this patch.
You are absolutely right. That's a mistake on my part. Looks like you've got it under control in your patch. Thanks!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104847/new/

https://reviews.llvm.org/D104847