[PATCH] D11664: [CUDA] Implemented additional processing steps needed to link with CUDA libdevice bitcode.

Mon Aug 24 14:10:06 PDT 2015

tra added inline comments.

================
Comment at: lib/CodeGen/CodeGenAction.cpp:166-170
@@ +165,7 @@
+        std::vector<const char *> ModuleFuncNames;
+        // We need to internalize contents of the linked module but it
+        // has to be done *after* the linking because internalized
+        // symbols will not be linked in otherwise.
+        // In order to do that, we preserve current list of function names in
+        // the module and then pass it to Internalize pass to preserve.
+        if (LangOpts.CUDA && LangOpts.CUDAIsDevice &&
----------------
echristo wrote:
> Can you explain this in a different way perhaps? I'm not sure what you mean here.
From llvm.org/docs/NVPTXUsage.html

This patch implements following items:

> The internalize pass is also recommended to remove unused math functions from the resulting PTX. For an input IR module module.bc, the following compilation flow is recommended:
> 
> 1 Save list of external functions in module.bc
> 2 Link module.bc with libdevice.compute_XX.YY.bc
> 3 Internalize all functions not in list from (1)
> 4 Eliminate all unused internal functions

LLVM part of the changes takes care of NVVMReflect:

> * Run NVVMReflect pass
> * Run standard optimization pipeline



================
Comment at: lib/CodeGen/CodeGenAction.cpp:181-190
@@ -166,2 +180,12 @@
           return;
+        if (LangOpts.CUDA && LangOpts.CUDAIsDevice &&
+            LangOpts.CUDAUsesLibDevice) {
+          legacy::PassManager passes;
+          passes.add(createInternalizePass(ModuleFuncNames));
+          // Considering that most of the functions we've linked are
+          // not going to be used, we may want to eliminate them
+          // early.
+          passes.add(createGlobalDCEPass());
+          passes.run(*TheModule);
+        }
       }
----------------
echristo wrote:
> Seems like this should be part of the normal IPO pass run? This seems like an odd place to put this, can you explain why a bit more?
It will indeed happen during normal optimization, but as NVPTX docs says it makes fair amount of sense to eliminate quite a bit of bitcode that we know we're not going to need. libdevice carries ~450 functions and only handful of those are needed. Why run all other optimization passes on them?

In addition to that, we need to pass to Internalize list of symbols to preserve. As far as I can tell the way to do it within normal optimization pipeline is to pass them to back-end via -internalize-public-api-list/-internalize-public-api-file. That's not particularly suitable way to carry potentially large list of symbols we will find in the TU we're dealing with.

I could move GDCE to LLVM where it would arguably be somewhat more effective if done after NVVMReflect, but keeping it next to internalize makes it easier to see that we intentionally internalize and eliminate unused bitcode here.


http://reviews.llvm.org/D11664