[PATCH] D110089: [CUDA] Implement experimental support for texture lookups.

Fri Sep 24 16:12:21 PDT 2021

tra added a comment.

In D110089#3021652 <https://reviews.llvm.org/D110089#3021652>, @jlebar wrote:

> Okay, I give up on the phab interface.  It's unreadable with all the existing
> comments and lint errors.

Yeah. Phabricator experience is not great.

> +// Put all functions into anonymous namespace so they have internal linkage.
>
> Say a little more?  Specifically, you want anon ns because this is device code
> and it has to work even without being linked.
>
> (Also, are you sure that plain `inline` doesn't do the right thing?  Like, we
> have lots of CUDA headers that are `inline`'ed without all being in an anon
> ns.)

We do want inlining, but the main purpose here is to avoid potential ODR for use with -fgpu-rdc where multiple TUs may be compiled with different versions of this header. Because the hash may change, we could end up with thesame Tag types (and fetch functions) with the same names meaning different things for different TUs.

> +// NOTE: the perfect hashing scheme comes with inherent self-test. If the hash
> +// function has a collision for any of the texture operations, the compilation
> +// will fail due to an attempt to redefine a tag with the same value. If the
> +// header compiles, then the hash function is good enough for the job.
>
> I guess if it has a self-test then that's fine.  Though is this really better
> than a series of `if` statements with strcmp?

Yes, I think somewhat obfuscated metaprogramming here wins on points, IMO.

- it's fairly well-structured, even if macros make it a bit of a pain to dig through.
- assumptions about the perfect hash are minimal -- it's just a 1:1 string->integer map. If that assumption is violated we're guaranteed to get a compilation error when we instantiate the templates that map to the same value. I did test that by changing the hash function.
- strcmp() will result in 100+ comparisons. That alone will be a pain to write manually. In my experience, having more than a handful of nearly-identical, but critically different chunks of code makes the whole thing very error-prone. I've tried that early on before I've figured out how to parameterize templates by a string.
- We'll also need to use it in a function template, so the code will get replicated over all the instances of the signatures we'll need to impement. While it's probably no a showstopper, it's still additional IR we'd have to deal with. Adding incremental burden on all CUDA users is worse than additional mental burden on whoever may need to read this code (most likely me).

> I guess I am scared of this kind of thing because I did it once in ccache.  I thought I was very clever and got
> a good speedup.  1 year later I found out I'd broken handling of __DATE__ and __TIME__.  o.O

Being wary of making an easy-to-miss errors here, I literally did exhaustive testing of all variants (2972 of them) of high-level API calls provided by NVIDIA headers  and verified that we do end up generating identical instructions and their parameters.
I will add that test in the test-suite as it needs actual CUDA headers.

In D110089#3021659 <https://reviews.llvm.org/D110089#3021659>, @jlebar wrote:

> Presumably as a separate commit we should add tests to the test_suite repository to ensure that this at least still compiles with different versions of CUDA?

That's the plan.  I've tested thhe patch manually down to CUDA-9. It will not work with CUDA-8 or older as they have completely different under-the hood implementation in CUDA headers. I'll add an include guard for the old CUDA versions.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D110089/new/

https://reviews.llvm.org/D110089