[PATCH] D125904: [Cuda] Use fallback method to mangle externalized decls if no CUID given

Joseph Huber via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue May 24 17:12:08 PDT 2022


jhuber6 added a comment.

In D125904#3535830 <https://reviews.llvm.org/D125904#3535830>, @tra wrote:

> I'm still itching to figure out a way to avoid CUID altogether and with the new driver it may be possible.

I would be 100% in favor of working around this if possible, it's proving to be one of the most painful parts of the process.

> CUID serves two purposes:
> a) avoid name conflicts during device-side linking ("must be globally unique" part)
> b) allow host to refer to something in the GPU executable ("stable within TU" part)
>
> My understanding that we already collect the data about all offloading entities and that include those we have to externalize. We also postpone generation of the registration glue to the final linking step.

Yes, we would have all those entries see here <https://godbolt.org/z/vTjsvY85q>. The final linker just gets a pointer to `__start_omp_offloading_entries` so we can iterate this at runtime.

> Let's suppose that we do not externalize those normally-internal symbols. The offloading table would still have entries for them, but there will be no issue with name conflicts during linking, as they do remain internal.

We would also need to make sure that they're used so they don't get optimized out.

> During the final linking, if an an offloading entity uses a pointer w/o a public symbol, we would be in position to generate a unique one, using the pointer value in the offload table entry. Linker can just use a free-running counter for the suffix, or could just generate a completely new symbol. It does not matter.

This is the part I'm not sure about, how would we generate new symbols during the linking stage? We can only iterate the offloading entry table after the final linking, which is when we're already supposed to have a fully linked and registered module. We could potentially generate the same kind of table for the device, but I don't think `nvlink` would perform the same linker magic to merge those entries.

> When we generate the host-side registration glue, we'll use the name of that generated symbol.

When we make the registration glue we haven't created the final executable, so I don't think we could modify existing entries, only create new ones.

> In the end linking will work exactly as it would for C++ (modulo having offloading tables) and host/device registration will be ensured by telling host side which symbols to use, instead of assuming that we've happened to generate exactly the same unique suffix on both sides.
>
> @yaxunl -- do you see any holes in this approach?

I can't think of a way to generate these new symbols, we'd need to somehow have a list of all the static entries that need new symbols and then modify the object file after its been made. Not sure if this is possible in general considering the vendor linkers might not behave. I'm definitely open to discussion though, I'd love to have a solution for this.



================
Comment at: clang/lib/CodeGen/CodeGenModule.cpp:6836
+
+  // If the CUID is not specified we try to generate a unique postfix.
+  if (getLangOpts().CUID.empty()) {
----------------
tra wrote:
> jhuber6 wrote:
> > jhuber6 wrote:
> > > tra wrote:
> > > > jhuber6 wrote:
> > > > > tra wrote:
> > > > > > > However, [CUID] is not always availible. 
> > > > > > 
> > > > > > The question is -- when and why is it not available? I'm getting the feeling that we're fixing the consequence here, not the root cause.
> > > > > > 
> > > > > > Is there a reason we can't make sure that the driver always generates a cuid for offload subcompilations and error out if it's needed but is not provided?
> > > > > > That would make this fallback unnecessary and would be a more robust approach in general.
> > > > > > 
> > > > > So, I'm more in favor of this approach because it doesn't require extra intervention from the compiler driver, this makes it less convoluted to do split compilation since we don't have an extra arguments. The way I would prefer it, is that we do this implicitly by default without requiring extra thought from the driver, but if it's not good enough we can support the manual `CUID` approach to let the user override it. I think this is a cleaner implementation, and is mostly coming from my support for CUDA in the new driver which currently doesn't implement the CUID as we do with the old driver. Generally I'd prefer things to behave independent of the driver, so we can consider host and device compilation more separately.
> > > > > So, I'm more in favor of this approach because it doesn't require extra intervention from the compiler driver
> > > > 
> > > > We need the driver intervention for any cc1 compilations anyways, so this does not buy us anything.  While you can run a sub-compilation manually with handcrafted cc1 flags, that's not a practical use case. The driver is the ultimate source of cc1 flags.
> > > > 
> > > > > this makes it less convoluted to do split compilation since we don't have an extra arguments.
> > > > 
> > > > For CUDA/HIP sub-compilation should be done with clang --cuda-host-only/--cuda-device-only.  Whether the driver supplies yet another cc1 option, --cuid=... makes no difference to the user launching such sub-compilation. 
> > > > 
> > > > > The way I would prefer it, is that we do this implicitly by default without requiring extra thought from the driver, but if it's not good enough we can support the manual CUID approach to let the user override it.
> > > > 
> > > > I agree that we can come up with something that will almost always work. Possibly even good enough for all practical purposes. However, if a better solution would take comparable effort, it would make sense to do things right and avoid adding technical debt. 
> > > > 
> > > > On the other hand, requiring the driver to supply identical cuid to all sub-compilations appears to be a better approach to me:
> > > > * Driver is the best place to do it, functionally. Driver has access to all user-provided inputs and is in position to guarantee that all subcompilations get the same cuid.
> > > > * Calculating CUID in the driver keeps relevant logic in one place. Doing it in the driver *and* in the codegen 
> > > > * Figuring out what inputs are relevant for calculation of CUID in cc1 invocation is error prone. E.g. we have to guess which cc1 options are relevant or not and is the driver would pass a macro to one subcompilation but not to another, we would end up generating mismatching CUID and would not have any way to notice that. Even when that's not the case, we would need to guess which flags, supplied by the driver, are relevant. At CC1 level that may be somewhat complicated as top-level options may expand to quite a few more cc1 options. E.g. we'll need to take into account `-std=...`, `--cuda-path=`, `-include ...`, `-I` (and other include paths)... All of that does not belong to the codegen.
> > > > 
> > > > The driver is already doing CUID computation, so I do not see any downsides to just letting it do its job, and I do believe it will be a better, and likely less complicated, solution.
> > > > 
> > > > > ... mostly coming from my support for CUDA in the new driver which currently doesn't implement the CUID as we do with the old driver
> > > > 
> > > > Right. That appears to be the key missing piece.
> > > > 
> > > > What are the obstacles for having CUID calculation done in the new driver. It should have all the info it needs. What am I missing?
> > > > 
> > > > For CUDA/HIP sub-compilation should be done with clang --cuda-host-only/--cuda-device-only. Whether the driver supplies yet another cc1 option, --cuid=... makes no difference to the user launching such sub-compilation.
> > > The problem I have with this is that we use the command line to generate the value, so they aren't going to be the same without the user manually specifying it. I guess we could filter out only "relevant" command line flags, maybe that's an option. I just think it's not intuitive for a name mangling scheme to depend on something external, but there's definitely advantages to doing it that way.
> > > 
> > > I can see your point for the Driver handling this stuff. Now that I'm thinking about it I don't think looking at the macros or the other arguments is a sound solution in the first place. Even without that it would work for almost all the same cases just using the file's unique ID. Without that, this solution is guaranteed not to conflict with any other file on the same file system at the time of compilation. This, as we discussed, potentially fails for non-static source trees and compiling the same file twice and linking it. The current CUID implementation fails on the former, this method fails on both.
> > > 
> > > If the CUID didn't exist, the way I would have implemented it would simply be with the File-ID, and have the CUID be a simple marshalling option that lets the user override it to something unique if needed. I personally think that's simpler for 99.99% of cases and has an easy-out in the last 0.01%. Given that it already exists there's some desire to keep it since the work has already been done I understand.
> > > 
> > > > What are the obstacles for having CUID calculation done in the new driver. It should have all the info it needs. What am I missing?
> > > It's less of a difficulty in implementing and more hoping we could make the name mangling more simple and work by default without the driver. 
> > > Also, we may need this support for a single case in OpenMP, and I'd prefer not need to generate the CUID for OpenMP offloading when it's unused the vast majority of the time. Generally I'd prefer if compiling for the host / device was conceptually the same to the user without requiring external values. If we're sold on the CUID method I can go forward with that, but from my perspective what it's buying us is the ability to compile the following
> > > ```
> > > static __device__ int a;
> > > 
> > > #ifdef MACRO
> > >   do_something_with(a);
> > > #else
> > >   do_something_else();
> > > #endif
> > > ```
> > > ```
> > > clang foo.cu -DMACRO -c -o 1.o
> > > clang foo.cu 1.o
> > > ```
> > > 
> > > This is just a tough problem overall, I don't think there's a single perfect solution. Whatever we choose we'll be trading reproducibility for correctness or whatever. You have more seniority in this space so it's your call what you think I should go forward with. 
> > Also, it's incredibly convoluted, but I can think of a way to break even the current CUID for this.
> > ```
> > static __device__ int a;
> > 
> > __device__ int __attribute__((weak)) *a_ref = &a;
> > ```
> > ```
> > $ clang a.cu -c -fgpu-rdc
> > $ mv a.o b.o
> > $ clang a.cu -c -fgpu-rdc
> > $ nvlink a.o b.o -arch=sm_35 -o out.cubin
> > nvlink error   : Multiple definition of '_ZL1a__static__d041026c8e4167e6' in '1.o', first defined in 'a.o'
> > nvlink fatal   : merge_elf failed
> > ```
> > The problem I have with this is that we use the command line to generate the value, so they aren't going to be the same without the user manually specifying it. I guess we could filter out only "relevant" command line flags, maybe that's an option. I just think it's not intuitive for a name mangling scheme to depend on something external, but there's definitely advantages to doing it that way.
> 
> I'm not sure I follow the "they aren't going to be the same without the user manually specifying it." part. Do you mean that CUIDs passed to sub-compilations would not be same?
> If so, why would that be the case? If would be up to the driver to pick the same set of inputs to hash into the cuid. We only case about single compilation. Separately compiling host/device with --cuda-host/device-only makes it two different compilations, which we may or may not provide any guarantees about. In case we don't we can document that it would be up to user to ensure consistency between host/device objects by using explicit --cuid argument. Within single top-level compilation the driver should have no problem picking single cuid value and passing it on to all subcompilations. 
> 
> > I don't think looking at the macros or the other arguments is a sound solution in the first place.
> 
> They are part of the compilation input set, along with include-related options and, likely, options like `-std` that also affect the sources seen by compiler.
> 
> If we have to generate globally-stable cuid within a cc1 compilation, we have to take as much of the relevant input set for the compilation as practical. I believe preprocessor-related options are relevant to existing use patterns. E.g. compiling the same source with different preprocessor definitions does happen. 
> 
> We're dealing with more than one issue here.
> * who/where is responsible for CUID generation:
>   - driver only
>   - CC1 only
>   - driver as the primary source of CUID and CC1 as the fallback.
> 
> * how do we guarantee CUID stability within single TU compilation, while ensuring global uniqueness.
>   - We can guarantee build-wise uniqueness if we delegate CUID generation to the build system which does know about all compilations and can simply enumerate all of them.
>   - We can not generate globally unique CUID strictly within clang, whether by driver or by CC1. In both cases we'll have some chance of collisions and will need a way to deal with them.
>   - Driver can guarantee within-compilation stability by generating CUID once and passing it to CC1 instances.
>   - Generating CUID within CC1 relies on all CC1 instances producing the same CUID value. It's feasible if we can guarantee that all CC1 instances always operate on identical set of inputs taken into account during CUID generation. That is a dependency on implementation details as those inputs would likely depend on what the driver does. Can we make it work? Probably. But why? 
> 
> > It's less of a difficulty in implementing and more hoping we could make the name mangling more simple and work by default without the driver.
> 
> I do not think "without the driver part" (e.g. directly running -cc1)  is a good metric for driving compiler development. It's the driver's explicit purpose to hide the complexity of the actual compiler command line. 
> 
> If you think there's a practical use case of fallback cuid generation in cc1, I would consider it if it were done in parallel with driver-generated CUID during regular top-level compilation. I.e. `clang a.cu` would run `clang -cc1 --cuid.=<driver-provided-cuid>`, but if one runs `clang -cc1` w/o --cuid, one would be generated for them internally. I would still prefer to see a warning for that, because existence of CUID will be something not obvious to the users and it would likely be very easy to end up with mismatched CUIDs used between the host and device compilations.
> 
> I'm not sure I follow the "they aren't going to be the same without the user manually specifying it." part. Do you mean that CUIDs passed to sub-compilations would not be same?
What I mean is just that if the user does this then it's not going to work.
```
clang foo.cu --offload-device-only -c // different arguments give different cuid
clang foo.cu --offload-host-only -c 
```

> I do not think "without the driver part" (e.g. directly running -cc1) is a good metric for driving compiler development. It's the driver's explicit purpose to hide the complexity of the actual compiler command line.
Fair enough, I just figured this option was more straightforward than passing things in from the command line.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125904/new/

https://reviews.llvm.org/D125904



More information about the cfe-commits mailing list