[PATCH] D82930: [HIP] Fix rocm detection

Artem Belevich via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue Jul 7 11:58:35 PDT 2020


tra added inline comments.


================
Comment at: clang/lib/Driver/ToolChains/AMDGPU.cpp:167
+    llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> VersionFile =
+        FS.getBufferForFile(BinPath + "/.hipVersion");
+    if (!VersionFile)
----------------
arsenm wrote:
> tra wrote:
> > arsenm wrote:
> > > yaxunl wrote:
> > > > yaxunl wrote:
> > > > > yaxunl wrote:
> > > > > > tra wrote:
> > > > > > > arsenm wrote:
> > > > > > > > yaxunl wrote:
> > > > > > > > > arsenm wrote:
> > > > > > > > > > arsenm wrote:
> > > > > > > > > > > yaxunl wrote:
> > > > > > > > > > > > arsenm wrote:
> > > > > > > > > > > > > I don't think the detection should fail if this is missing
> > > > > > > > > > > > why? this file is unique to HIP installation. If it is missing, the installation is broken.
> > > > > > > > > > > Because I should be able to use this without a complete hip installation. Without a specified version, it should assume the most modern layout. This will for example break pointing --rocm-path at the device library build directory for library tests
> > > > > > > > > > I also don't see what value checking the version really provides; it may be informative to print it, but I don't think it's useful to derive information from it
> > > > > > > > > what is the directory structure of your most modern layout?
> > > > > > > > /opt/rocm/amdgcn/bitcode/foo.bc
> > > > > > > > 
> > > > > > > > The plan is to remove this and rely on symlinks in the resource directory though
> > > > > > > > I also don't see what value checking the version really provides; it may be informative to print it, but I don't think it's useful to derive information from it
> > > > > > > 
> > > > > > > In CUDA it's used to detect which back-end features to enable (they depend on what's supported by ptxas supplied by that CUDA version). I don't think that would be relevant to AMDGPU as it does not need external dependencies to generate GPU code.
> > > > > > > 
> > > > > > > It may be useful for filesystem layout detection. E.g. where to find bitcode files and what names to expect. This part seems to be relevant for ROCm, but I'm not sure if the layout is closely tied to the version.
> > > > > > > 
> > > > > > We are required to support previous ROCm releases for certain time range. To do that we need to detect HIP version and enable/disable certain HIP features based on HIP version when necessary.
> > > > > > 
> > > > > > Therefore if we have a new directory structure for ROCm, that directory structure should contain a HIP version file so that we can detect HIP version.
> > > > > > 
> > > > > > 
> > > > > Currently clang includes some wrapper headers by default, which does not work with ROCm 3.5 since those device functions are defined in HIP headers of ROCm 3.5. To support ROCm 3.5, we have to disable including those wrapper headers. This is one example why we need detect HIP version.
> > > > I think we need to separate HIP runtime detection and device library detection and also separate hasValidHIPRuntime and hasValidDeviceLibrary. ROCm toolchain only need to make sure hasValidDeviceLibrary but HIP toolchain need both hasValidDeviceLibrary and hasValidHIPRuntime.
> > > Regardless of whether there's a version file or if does anything, I think the absence of one implies do the most modern thing
> > "The most modern thing" is an ever moving target. If the failure modes keep changing it will create a source of new problems every time something changes. Probably not a big issue in this case, but with (eventually) wide enough user base there will be some.
> > 
> > I assume that AMD does not have much of legacy user base yet for compilations with clang, so defaulting to a recent version may be sensible (unlike CUDA where clang had to deal with a lot of existing CUDA users using old CUDA versions). That said, I'd recommend pinning it to a **specific** version, so the expectations remain stable. Then we can document the failure/fallback in a way users can deal with -- `if no version file is found, clang assumes version x.y and expects to find files X, Y and Z in directory A, B, C. You can specify the locations manually with --something-something-X=A or specify the version with --hip-version=3.7`. 
> > 
> > Considering that clang is being used from various derived tools, you may not always have the luxury of having the whole ROCm installation around and need to be able to override the expectations via command line flags.
> > 
> > If we have a way to override the version, then it does not matter all that much which version we pick as a fallback. If the guess is wrong, we can always correct it with a flag.
> Unlike CUDA we're not really chasing some 3rd party thing that periodically drops unpredictable changes. We know what the most modern thing is on tip of tree
> Unlike CUDA we're not really chasing some 3rd party thing that periodically drops unpredictable changes. We know what the most modern thing is on tip of tree
We're in agreement on this part. If lack of the version file is non-fatal, I'm fine with whatever version AMD picks as the default.

I'm just pointing out that you'll likely need a way to keep compiler working even if the filesystem does not have everything you expect for the full end-to-end compilation, be it the version file or bitcode. Reasonable default is fine. Overrides can be implemented later, if necessary. 



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82930/new/

https://reviews.llvm.org/D82930





More information about the cfe-commits mailing list