[PATCH] D11664: [CUDA] Implemented additional processing steps needed to link with CUDA libdevice bitcode.

Mon Aug 24 14:46:58 PDT 2015

echristo added inline comments.

================
Comment at: lib/CodeGen/CodeGenAction.cpp:181-190
@@ -166,2 +180,12 @@
           return;
+        if (LangOpts.CUDA && LangOpts.CUDAIsDevice &&
+            LangOpts.CUDAUsesLibDevice) {
+          legacy::PassManager passes;
+          passes.add(createInternalizePass(ModuleFuncNames));
+          // Considering that most of the functions we've linked are
+          // not going to be used, we may want to eliminate them
+          // early.
+          passes.add(createGlobalDCEPass());
+          passes.run(*TheModule);
+        }
       }
----------------
tra wrote:
> echristo wrote:
> > Seems like this should be part of the normal IPO pass run? This seems like an odd place to put this, can you explain why a bit more?
> It will indeed happen during normal optimization, but as NVPTX docs says it makes fair amount of sense to eliminate quite a bit of bitcode that we know we're not going to need. libdevice carries ~450 functions and only handful of those are needed. Why run all other optimization passes on them?
> 
> In addition to that, we need to pass to Internalize list of symbols to preserve. As far as I can tell the way to do it within normal optimization pipeline is to pass them to back-end via -internalize-public-api-list/-internalize-public-api-file. That's not particularly suitable way to carry potentially large list of symbols we will find in the TU we're dealing with.
> 
> I could move GDCE to LLVM where it would arguably be somewhat more effective if done after NVVMReflect, but keeping it next to internalize makes it easier to see that we intentionally internalize and eliminate unused bitcode here.
I might not have been clear. I'm curious why all of this isn't just part of the normal IPO pass run that should be happening on the code anyhow? Taking a step back - this should just go through the normal "let's set up a pipeline for the code", which might end up being cuda specific, but should be handled in the same way.

That make sense?


http://reviews.llvm.org/D11664