[PATCH] D19990: [CUDA] Implement __ldg using intrinsics.

Thu May 5 13:43:19 PDT 2016

jlebar added inline comments.

================
Comment at: include/clang/Basic/BuiltinsNVPTX.def:569-603
@@ -568,1 +568,37 @@
 
+// __ldg.  This is not implemented as a builtin by nvcc.
+BUILTIN(__nvvm_ldg_c, "ccC*", "")
+BUILTIN(__nvvm_ldg_s, "ssC*", "")
+BUILTIN(__nvvm_ldg_i, "iiC*", "")
+BUILTIN(__nvvm_ldg_l, "LiLiC*", "")
+BUILTIN(__nvvm_ldg_ll, "LLiLLiC*", "")
+
+BUILTIN(__nvvm_ldg_uc, "UcUcC*", "")
+BUILTIN(__nvvm_ldg_us, "UsUsC*", "")
+BUILTIN(__nvvm_ldg_ui, "UiUiC*", "")
+BUILTIN(__nvvm_ldg_ul, "ULiULiC*", "")
+BUILTIN(__nvvm_ldg_ull, "ULLiULLiC*", "")
+
+BUILTIN(__nvvm_ldg_f, "ffC*", "")
+BUILTIN(__nvvm_ldg_d, "ddC*", "")
+
+BUILTIN(__nvvm_ldg_c2, "E2cE2cC*", "")
+BUILTIN(__nvvm_ldg_c4, "E4cE4cC*", "")
+BUILTIN(__nvvm_ldg_s2, "E2sE2sC*", "")
+BUILTIN(__nvvm_ldg_s4, "E4sE4sC*", "")
+BUILTIN(__nvvm_ldg_i2, "E2iE2iC*", "")
+BUILTIN(__nvvm_ldg_i4, "E4iE4iC*", "")
+BUILTIN(__nvvm_ldg_ll2, "E2LLiE2LLiC*", "")
+
+BUILTIN(__nvvm_ldg_uc2, "E2UcE2UcC*", "")
+BUILTIN(__nvvm_ldg_uc4, "E4UcE4UcC*", "")
+BUILTIN(__nvvm_ldg_us2, "E2UsE2UsC*", "")
+BUILTIN(__nvvm_ldg_us4, "E4UsE4UsC*", "")
+BUILTIN(__nvvm_ldg_ui2, "E2UiE2UiC*", "")
+BUILTIN(__nvvm_ldg_ui4, "E4UiE4UiC*", "")
+BUILTIN(__nvvm_ldg_ull2, "E2ULLiE2ULLiC*", "")
+
+BUILTIN(__nvvm_ldg_f2, "E2fE2fC*", "")
+BUILTIN(__nvvm_ldg_f4, "E4fE4fC*", "")
+BUILTIN(__nvvm_ldg_d2, "E2dE2dC*", "")
+
----------------
jlebar wrote:
> majnemer wrote:
> > Would it be crazy to instead provide a generic builtin? Would cut down on the number of variants...
> > 
> > `__builtin_add_overflow` is an example of such a builtin.
> Art is going to send you flowers.  :)  He and I just had an argument about this.
> 
> I think this isn't an unreasonable thing to want, but I think it's beneficial to be consistent with our existing API.  So if we offer a generic thing for ldg, it would be nice to have one for atomics above, which are basically the same.
> 
> So I told Art I'd prefer to add it to our list.
Oh, another thing is that, you really see the benefit of having a generic builtin when you start hitting the combinatorial explosion of all the different kinds of loads.  Like, as-is it's not so bad, but if you want to support all forms of ld.global.nc, there are four different caching behaviors.  Supporting all forms of ld is way worse.

Which is to say, if we're going to do the generic thing, it seems like we benefit the most by making it generic on more than the types.  But we're not ready to do that; I don't think most of these loads even exist in llvm atm.

http://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld


http://reviews.llvm.org/D19990