<div dir="ltr">Please check whether r237007 fixes the failure for you.<br><div><br></div><div>--Artem</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, May 11, 2015 at 11:34 AM, Artem Belevich <span dir="ltr"><<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">It looks like on s390x strings come with 'align 2'. I'll remove align from the CHECK constraint and that should fix the test failure on s390x. I'll commit the fix shortly.<div><br></div><div>--Artem<br><div class="gmail_extra"><div><div class="h5"><br><div class="gmail_quote">On Mon, May 11, 2015 at 10:43 AM, Artem Belevich <span dir="ltr"><<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><span style="font-family:monospace,monospace;font-size:12.8000001907349px">Could you send me output of the CC1 executed by the test before it's piped into FileCheck? </span><br></div><span><div><br></div><span style="font-family:monospace,monospace;font-size:12.8000001907349px">/scratch/hstong/workdir/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">Release+Asserts/bin/clang -cc1 -internal-isystem /scratch/hstong/workdir/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">Release+Asserts/bin/../lib/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">clang/3.7.0/include -nostdsysteminc -emit-llvm /gsa/tlbgsa-h1/08/hstong/pub/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">cfe_trunk/clang/test/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">CodeGenCUDA/</span><a href="http://device-stub.cu/" style="font-family:monospace,monospace;font-size:12.8000001907349px" target="_blank">device-stub.cu</a><span style="font-family:monospace,monospace;font-size:12.8000001907349px"> -fcuda-include-gpubinary /gsa/tlbgsa-h1/08/hstong/pub/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">cfe_trunk/clang/test/</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">CodeGenCUDA/</span><a href="http://device-stub.cu/" style="font-family:monospace,monospace;font-size:12.8000001907349px" target="_blank">device-stub.cu</a><span style="font-family:monospace,monospace;font-size:12.8000001907349px"> -o -</span><br><div><span style="font-family:monospace,monospace;font-size:12.8000001907349px"><br></span></div></span><div><span style="font-family:monospace,monospace;font-size:12.8000001907349px">Oh, and I see a typo in the script -- "</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">CHEKC: call{{.*}}kernelfunc</span><span style="font-family:monospace,monospace;font-size:12.8000001907349px">", though it's probably not what breaks the test in your case.</span></div><div><span style="font-family:monospace,monospace;font-size:12.8000001907349px"><br></span></div><div><span style="font-family:monospace,monospace;font-size:12.8000001907349px">--Artem</span></div></div><div class="gmail_extra"><div><div><br><div class="gmail_quote">On Sat, May 9, 2015 at 11:10 AM, Hubert Tong <span dir="ltr"><<a href="mailto:hubert.reinterpretcast@gmail.com" target="_blank">hubert.reinterpretcast@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>Hi Artem,<br><br></div>I am encountering a failure with <a href="http://device-stub.cu" target="_blank">device-stub.cu</a> on s390x-suse-linux. Can you take a look?<br><br><b>Output:</b><br><span style="font-family:monospace,monospace">FAIL: Clang :: CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a> (1986 of 21893)<br>******************** TEST 'Clang :: CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a>' FAILED ********************<br>Script:<br>--<br>/scratch/hstong/workdir/Release+Asserts/bin/clang -cc1 -internal-isystem /scratch/hstong/workdir/Release+Asserts/bin/../lib/clang/3.7.0/include -nostdsysteminc -emit-llvm /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a> -fcuda-include-gpubinary /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a> -o - | /scratch/hstong/workdir/Release+Asserts/bin/FileCheck /gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a><br>--<br>Exit Code: 1<br><br>Command Output (stderr):<br>--<br>/gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/device-stub.cu:7:11: error: expected string not found in input<span><br>// CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\00", align 1<br></span>          ^<br><stdin>:1:1: note: scanning from here<br>; ModuleID = '/gsa/tlbgsa-h1/08/hstong/pub/cfe_trunk/clang/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a>'<br>^<br><stdin>:13:298: note: possible intended match here<br>@1 = private unnamed_addr constant [2259 x i8] c"// RUN: %clang_cc1 -emit-llvm %s -fcuda-include-gpubinary %s -o - | FileCheck %s\0A\0A#include \22Inputs/cuda.h\22\0A\0A// Make sure that all parts of GPU code init/cleanup are there:\0A// * constant unnamed string with the kernel name\0A// CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\5C00\22, align 1\0A// * constant unnamed string with GPU binary\0A// CHECK: private unnamed_addr constant{{.*}}\5C00\22\0A// * constant struct that wraps GPU binary\0A// CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8*, i8* } \0A// CHECK: { i32 1180844977, i32 1, {{.*}}, i8* null }\0A// * variable to save GPU binary handle after initialization\0A// CHECK: @__cuda_gpubin_handle = internal global i8** null\0A// * Make sure our constructor/destructor was added to global ctor/dtor list.\0A// CHECK: @llvm.global_ctors = appending global {{.*}}@__cuda_module_ctor\0A// CHECK: @llvm.global_dtors = appending global {{.*}}@__cuda_module_dtor\0A\0A// Test that we build the correct number of calls to cudaSetupArgument followed\0A// by a call to cudaLaunch.\0A\0A// CHECK: define{{.*}}kernelfunc\0A// CHECK: call{{.*}}cudaSetupArgument\0A// CHECK: call{{.*}}cudaSetupArgument\0A// CHECK: call{{.*}}cudaSetupArgument\0A// CHECK: call{{.*}}cudaLaunch\0A__global__ void kernelfunc(int i, int j, int k) {}\0A\0A// Test that we've built correct kernel launch sequence.\0A// CHECK: define{{.*}}hostfunc\0A// CHECK: call{{.*}}cudaConfigureCall\0A// CHEKC: call{{.*}}kernelfunc\0Avoid hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }\0A\0A// Test that we've built a function to register kernels\0A// CHECK: define internal void @__cuda_register_kernels\0A// CHECK: call{{.*}}cudaRegisterFunction(i8** %0, {{.*}}kernelfunc\0A\0A// Test that we've built contructor..\0A// CHECK: define internal void @__cuda_module_ctor\0A// .. that calls __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)\0A// CHECK: call{{.*}}cudaRegisterFatBinary{{.*}}__cuda_fatbin_wrapper\0A// .. stores return value in __cuda_gpubin_handle\0A// CHECK-NEXT: store{{.*}}__cuda_gpubin_handle\0A// .. and then calls __cuda_register_kernels\0A// CHECK-NEXT: call void @__cuda_register_kernels\0A\0A// Test that we've created destructor.\0A// CHECK: define internal void @__cuda_module_dtor\0A// CHECK: load{{.*}}__cuda_gpubin_handle\0A// CHECK-NEXT: call void @__cudaUnregisterFatBinary\0A\0A\00", align 2<br>                                                                                                                                                                                                                                                                                                         ^<br><br>--<br><br>********************<br></span><br><b>Build environment info:</b><br><span style="font-family:monospace,monospace">> g++ -v<br>Using built-in specs.<br>COLLECT_GCC=g++<br>COLLECT_LTO_WRAPPER=/usr/lib64/gcc/s390x-suse-linux/4.8/lto-wrapper<br>Target: s390x-suse-linux<br>Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8 --enable-ssp --disable-libssp --disable-plugin --with-bugurl=<a href="http://bugs.opensuse.org/" target="_blank">http://bugs.opensuse.org/</a> --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --enable-linux-futex --program-suffix=-4.8 --without-system-libunwind --with-tune=zEC12 --with-arch=z196 --with-long-double-128 --enable-decimal-float --build=s390x-suse-linux --host=s390x-suse-linux<br>Thread model: posix<br>gcc version 4.8.3 20140627 [gcc-4_8-branch revision 212064] (SUSE Linux) <br><br></span></div><span style="font-family:arial,helvetica,sans-serif">Thanks,<br><br><br></span></div><span style="font-family:arial,helvetica,sans-serif">Hubert Tong</span><br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 7, 2015 at 2:34 PM, Artem Belevich <span dir="ltr"><<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Author: tra<br>
Date: Thu May  7 14:34:16 2015<br>
New Revision: 236765<br>
<br>
URL: <a href="http://llvm.org/viewvc/llvm-project?rev=236765&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=236765&view=rev</a><br>
Log:<br>
[cuda] Include GPU binary into host object file and generate init/deinit code.<br>
<br>
- added -fcuda-include-gpubinary option to incorporate results of<br>
  device-side compilation into host-side one.<br>
- generate code to register GPU binaries and associated kernels<br>
  with CUDA runtime and clean-up on exit.<br>
- added test case for init/deinit code generation.<br>
<br>
Differential Revision: <a href="http://reviews.llvm.org/D9507" target="_blank">http://reviews.llvm.org/D9507</a><br>
<br>
Modified:<br>
    cfe/trunk/include/clang/Driver/CC1Options.td<br>
    cfe/trunk/include/clang/Frontend/CodeGenOptions.h<br>
    cfe/trunk/lib/CodeGen/CGCUDANV.cpp<br>
    cfe/trunk/lib/CodeGen/CGCUDARuntime.h<br>
    cfe/trunk/lib/CodeGen/CodeGenFunction.cpp<br>
    cfe/trunk/lib/CodeGen/CodeGenModule.cpp<br>
    cfe/trunk/lib/Frontend/CompilerInvocation.cpp<br>
    cfe/trunk/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a><br>
<br>
Modified: cfe/trunk/include/clang/Driver/CC1Options.td<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Driver/CC1Options.td?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Driver/CC1Options.td?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/include/clang/Driver/CC1Options.td (original)<br>
+++ cfe/trunk/include/clang/Driver/CC1Options.td Thu May  7 14:34:16 2015<br>
@@ -631,6 +631,8 @@ def fcuda_allow_host_calls_from_host_dev<br>
 def fcuda_disable_target_call_checks : Flag<["-"],<br>
     "fcuda-disable-target-call-checks">,<br>
   HelpText<"Disable all cross-target (host, device, etc.) call checks in CUDA">;<br>
+def fcuda_include_gpubinary : Separate<["-"], "fcuda-include-gpubinary">,<br>
+  HelpText<"Incorporate CUDA device-side binary into host object file.">;<br>
<br>
 } // let Flags = [CC1Option]<br>
<br>
<br>
Modified: cfe/trunk/include/clang/Frontend/CodeGenOptions.h<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Frontend/CodeGenOptions.h?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Frontend/CodeGenOptions.h?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/include/clang/Frontend/CodeGenOptions.h (original)<br>
+++ cfe/trunk/include/clang/Frontend/CodeGenOptions.h Thu May  7 14:34:16 2015<br>
@@ -163,6 +163,11 @@ public:<br>
   /// Name of the profile file to use as input for -fprofile-instr-use<br>
   std::string InstrProfileInput;<br>
<br>
+  /// A list of file names passed with -fcuda-include-gpubinary options to<br>
+  /// forward to CUDA runtime back-end for incorporating them into host-side<br>
+  /// object file.<br>
+  std::vector<std::string> CudaGpuBinaryFileNames;<br>
+<br>
   /// Regular expression to select optimizations for which we should enable<br>
   /// optimization remarks. Transformation passes whose name matches this<br>
   /// expression (and support this feature), will emit a diagnostic<br>
<br>
Modified: cfe/trunk/lib/CodeGen/CGCUDANV.cpp<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDANV.cpp?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDANV.cpp?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/lib/CodeGen/CGCUDANV.cpp (original)<br>
+++ cfe/trunk/lib/CodeGen/CGCUDANV.cpp Thu May  7 14:34:16 2015<br>
@@ -20,7 +20,6 @@<br>
 #include "llvm/IR/CallSite.h"<br>
 #include "llvm/IR/Constants.h"<br>
 #include "llvm/IR/DerivedTypes.h"<br>
-#include <vector><br>
<br>
 using namespace clang;<br>
 using namespace CodeGen;<br>
@@ -30,29 +29,66 @@ namespace {<br>
 class CGNVCUDARuntime : public CGCUDARuntime {<br>
<br>
 private:<br>
-  llvm::Type *IntTy, *SizeTy;<br>
-  llvm::PointerType *CharPtrTy, *VoidPtrTy;<br>
+  llvm::Type *IntTy, *SizeTy, *VoidTy;<br>
+  llvm::PointerType *CharPtrTy, *VoidPtrTy, *VoidPtrPtrTy;<br>
+<br>
+  /// Convenience reference to LLVM Context<br>
+  llvm::LLVMContext &Context;<br>
+  /// Convenience reference to the current module<br>
+  llvm::Module &TheModule;<br>
+  /// Keeps track of kernel launch stubs emitted in this module<br>
+  llvm::SmallVector<llvm::Function *, 16> EmittedKernels;<br>
+  /// Keeps track of variables containing handles of GPU binaries. Populated by<br>
+  /// ModuleCtorFunction() and used to create corresponding cleanup calls in<br>
+  /// ModuleDtorFunction()<br>
+  llvm::SmallVector<llvm::GlobalVariable *, 16> GpuBinaryHandles;<br>
<br>
   llvm::Constant *getSetupArgumentFn() const;<br>
   llvm::Constant *getLaunchFn() const;<br>
<br>
+  /// Creates a function to register all kernel stubs generated in this module.<br>
+  llvm::Function *makeRegisterKernelsFn();<br>
+<br>
+  /// Helper function that generates a constant string and returns a pointer to<br>
+  /// the start of the string.  The result of this function can be used anywhere<br>
+  /// where the C code specifies const char*.<br>
+  llvm::Constant *makeConstantString(const std::string &Str,<br>
+                                     const std::string &Name = "",<br>
+                                     unsigned Alignment = 0) {<br>
+    llvm::Constant *Zeros[] = {llvm::ConstantInt::get(SizeTy, 0),<br>
+                               llvm::ConstantInt::get(SizeTy, 0)};<br>
+    auto *ConstStr = CGM.GetAddrOfConstantCString(Str, Name.c_str());<br>
+    return llvm::ConstantExpr::getGetElementPtr(ConstStr->getValueType(),<br>
+                                                ConstStr, Zeros);<br>
+ }<br>
+<br>
+  void emitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args);<br>
+<br>
 public:<br>
   CGNVCUDARuntime(CodeGenModule &CGM);<br>
<br>
-  void EmitDeviceStubBody(CodeGenFunction &CGF, FunctionArgList &Args) override;<br>
+  void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) override;<br>
+  /// Creates module constructor function<br>
+  llvm::Function *makeModuleCtorFunction() override;<br>
+  /// Creates module destructor function<br>
+  llvm::Function *makeModuleDtorFunction() override;<br>
 };<br>
<br>
 }<br>
<br>
-CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM) : CGCUDARuntime(CGM) {<br>
+CGNVCUDARuntime::CGNVCUDARuntime(CodeGenModule &CGM)<br>
+    : CGCUDARuntime(CGM), Context(CGM.getLLVMContext()),<br>
+      TheModule(CGM.getModule()) {<br>
   CodeGen::CodeGenTypes &Types = CGM.getTypes();<br>
   ASTContext &Ctx = CGM.getContext();<br>
<br>
   IntTy = Types.ConvertType(Ctx.IntTy);<br>
   SizeTy = Types.ConvertType(Ctx.getSizeType());<br>
+  VoidTy = llvm::Type::getVoidTy(Context);<br>
<br>
   CharPtrTy = llvm::PointerType::getUnqual(Types.ConvertType(Ctx.CharTy));<br>
   VoidPtrTy = cast<llvm::PointerType>(Types.ConvertType(Ctx.VoidPtrTy));<br>
+  VoidPtrPtrTy = VoidPtrTy->getPointerTo();<br>
 }<br>
<br>
 llvm::Constant *CGNVCUDARuntime::getSetupArgumentFn() const {<br>
@@ -68,14 +104,17 @@ llvm::Constant *CGNVCUDARuntime::getSetu<br>
<br>
 llvm::Constant *CGNVCUDARuntime::getLaunchFn() const {<br>
   // cudaError_t cudaLaunch(char *)<br>
-  std::vector<llvm::Type*> Params;<br>
-  Params.push_back(CharPtrTy);<br>
-  return CGM.CreateRuntimeFunction(llvm::FunctionType::get(IntTy,<br>
-                                                           Params, false),<br>
-                                   "cudaLaunch");<br>
+  return CGM.CreateRuntimeFunction(<br>
+      llvm::FunctionType::get(IntTy, CharPtrTy, false), "cudaLaunch");<br>
+}<br>
+<br>
+void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,<br>
+                                     FunctionArgList &Args) {<br>
+  EmittedKernels.push_back(CGF.CurFn);<br>
+  emitDeviceStubBody(CGF, Args);<br>
 }<br>
<br>
-void CGNVCUDARuntime::EmitDeviceStubBody(CodeGenFunction &CGF,<br>
+void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,<br>
                                          FunctionArgList &Args) {<br>
   // Build the argument value list and the argument stack struct type.<br>
   SmallVector<llvm::Value *, 16> ArgValues;<br>
@@ -87,8 +126,7 @@ void CGNVCUDARuntime::EmitDeviceStubBody<br>
     assert(isa<llvm::PointerType>(V->getType()) && "Arg type not PointerType");<br>
     ArgTypes.push_back(cast<llvm::PointerType>(V->getType())->getElementType());<br>
   }<br>
-  llvm::StructType *ArgStackTy = llvm::StructType::get(<br>
-      CGF.getLLVMContext(), ArgTypes);<br>
+  llvm::StructType *ArgStackTy = llvm::StructType::get(Context, ArgTypes);<br>
<br>
   llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");<br>
<br>
@@ -120,6 +158,160 @@ void CGNVCUDARuntime::EmitDeviceStubBody<br>
   CGF.EmitBlock(EndBlock);<br>
 }<br>
<br>
+/// Creates internal function to register all kernel stubs generated in this<br>
+/// module with the CUDA runtime.<br>
+/// \code<br>
+/// void __cuda_register_kernels(void** GpuBinaryHandle) {<br>
+///    __cudaRegisterFunction(GpuBinaryHandle,Kernel0,...);<br>
+///    ...<br>
+///    __cudaRegisterFunction(GpuBinaryHandle,KernelM,...);<br>
+/// }<br>
+/// \endcode<br>
+llvm::Function *CGNVCUDARuntime::makeRegisterKernelsFn() {<br>
+  llvm::Function *RegisterKernelsFunc = llvm::Function::Create(<br>
+      llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),<br>
+      llvm::GlobalValue::InternalLinkage, "__cuda_register_kernels", &TheModule);<br>
+  llvm::BasicBlock *EntryBB =<br>
+      llvm::BasicBlock::Create(Context, "entry", RegisterKernelsFunc);<br>
+  CGBuilderTy Builder(Context);<br>
+  Builder.SetInsertPoint(EntryBB);<br>
+<br>
+  // void __cudaRegisterFunction(void **, const char *, char *, const char *,<br>
+  //                             int, uint3*, uint3*, dim3*, dim3*, int*)<br>
+  std::vector<llvm::Type *> RegisterFuncParams = {<br>
+      VoidPtrPtrTy, CharPtrTy, CharPtrTy, CharPtrTy, IntTy,<br>
+      VoidPtrTy,    VoidPtrTy, VoidPtrTy, VoidPtrTy, IntTy->getPointerTo()};<br>
+  llvm::Constant *RegisterFunc = CGM.CreateRuntimeFunction(<br>
+      llvm::FunctionType::get(IntTy, RegisterFuncParams, false),<br>
+      "__cudaRegisterFunction");<br>
+<br>
+  // Extract GpuBinaryHandle passed as the first argument passed to<br>
+  // __cuda_register_kernels() and generate __cudaRegisterFunction() call for<br>
+  // each emitted kernel.<br>
+  llvm::Argument &GpuBinaryHandlePtr = *RegisterKernelsFunc->arg_begin();<br>
+  for (llvm::Function *Kernel : EmittedKernels) {<br>
+    llvm::Constant *KernelName = makeConstantString(Kernel->getName());<br>
+    llvm::Constant *NullPtr = llvm::ConstantPointerNull::get(VoidPtrTy);<br>
+    llvm::Value *args[] = {<br>
+        &GpuBinaryHandlePtr, Builder.CreateBitCast(Kernel, VoidPtrTy),<br>
+        KernelName, KernelName, llvm::ConstantInt::get(IntTy, -1), NullPtr,<br>
+        NullPtr, NullPtr, NullPtr,<br>
+        llvm::ConstantPointerNull::get(IntTy->getPointerTo())};<br>
+    Builder.CreateCall(RegisterFunc, args);<br>
+  }<br>
+<br>
+  Builder.CreateRetVoid();<br>
+  return RegisterKernelsFunc;<br>
+}<br>
+<br>
+/// Creates a global constructor function for the module:<br>
+/// \code<br>
+/// void __cuda_module_ctor(void*) {<br>
+///     Handle0 = __cudaRegisterFatBinary(GpuBinaryBlob0);<br>
+///     __cuda_register_kernels(Handle0);<br>
+///     ...<br>
+///     HandleN = __cudaRegisterFatBinary(GpuBinaryBlobN);<br>
+///     __cuda_register_kernels(HandleN);<br>
+/// }<br>
+/// \endcode<br>
+llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {<br>
+  // void __cuda_register_kernels(void* handle);<br>
+  llvm::Function *RegisterKernelsFunc = makeRegisterKernelsFn();<br>
+  // void ** __cudaRegisterFatBinary(void *);<br>
+  llvm::Constant *RegisterFatbinFunc = CGM.CreateRuntimeFunction(<br>
+      llvm::FunctionType::get(VoidPtrPtrTy, VoidPtrTy, false),<br>
+      "__cudaRegisterFatBinary");<br>
+  // struct { int magic, int version, void * gpu_binary, void * dont_care };<br>
+  llvm::StructType *FatbinWrapperTy =<br>
+      llvm::StructType::get(IntTy, IntTy, VoidPtrTy, VoidPtrTy, nullptr);<br>
+<br>
+  llvm::Function *ModuleCtorFunc = llvm::Function::Create(<br>
+      llvm::FunctionType::get(VoidTy, VoidPtrTy, false),<br>
+      llvm::GlobalValue::InternalLinkage, "__cuda_module_ctor", &TheModule);<br>
+  llvm::BasicBlock *CtorEntryBB =<br>
+      llvm::BasicBlock::Create(Context, "entry", ModuleCtorFunc);<br>
+  CGBuilderTy CtorBuilder(Context);<br>
+<br>
+  CtorBuilder.SetInsertPoint(CtorEntryBB);<br>
+<br>
+  // For each GPU binary, register it with the CUDA runtime and store returned<br>
+  // handle in a global variable and save the handle in GpuBinaryHandles vector<br>
+  // to be cleaned up in destructor on exit. Then associate all known kernels<br>
+  // with the GPU binary handle so CUDA runtime can figure out what to call on<br>
+  // the GPU side.<br>
+  for (const std::string &GpuBinaryFileName :<br>
+       CGM.getCodeGenOpts().CudaGpuBinaryFileNames) {<br>
+    llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> GpuBinaryOrErr =<br>
+        llvm::MemoryBuffer::getFileOrSTDIN(GpuBinaryFileName);<br>
+    if (std::error_code EC = GpuBinaryOrErr.getError()) {<br>
+      CGM.getDiags().Report(diag::err_cannot_open_file) << GpuBinaryFileName<br>
+                                                        << EC.message();<br>
+      continue;<br>
+    }<br>
+<br>
+    // Create initialized wrapper structure that points to the loaded GPU binary<br>
+    llvm::Constant *Values[] = {<br>
+        llvm::ConstantInt::get(IntTy, 0x466243b1), // Fatbin wrapper magic.<br>
+        llvm::ConstantInt::get(IntTy, 1),          // Fatbin version.<br>
+        makeConstantString(GpuBinaryOrErr.get()->getBuffer(), "", 16), // Data.<br>
+        llvm::ConstantPointerNull::get(VoidPtrTy)}; // Unused in fatbin v1.<br>
+    llvm::GlobalVariable *FatbinWrapper = new llvm::GlobalVariable(<br>
+        TheModule, FatbinWrapperTy, true, llvm::GlobalValue::InternalLinkage,<br>
+        llvm::ConstantStruct::get(FatbinWrapperTy, Values),<br>
+        "__cuda_fatbin_wrapper");<br>
+<br>
+    // GpuBinaryHandle = __cudaRegisterFatBinary(&FatbinWrapper);<br>
+    llvm::CallInst *RegisterFatbinCall = CtorBuilder.CreateCall(<br>
+        RegisterFatbinFunc,<br>
+        CtorBuilder.CreateBitCast(FatbinWrapper, VoidPtrTy));<br>
+    llvm::GlobalVariable *GpuBinaryHandle = new llvm::GlobalVariable(<br>
+        TheModule, VoidPtrPtrTy, false, llvm::GlobalValue::InternalLinkage,<br>
+        llvm::ConstantPointerNull::get(VoidPtrPtrTy), "__cuda_gpubin_handle");<br>
+    CtorBuilder.CreateStore(RegisterFatbinCall, GpuBinaryHandle, false);<br>
+<br>
+    // Call __cuda_register_kernels(GpuBinaryHandle);<br>
+    CtorBuilder.CreateCall(RegisterKernelsFunc, RegisterFatbinCall);<br>
+<br>
+    // Save GpuBinaryHandle so we can unregister it in destructor.<br>
+    GpuBinaryHandles.push_back(GpuBinaryHandle);<br>
+  }<br>
+<br>
+  CtorBuilder.CreateRetVoid();<br>
+  return ModuleCtorFunc;<br>
+}<br>
+<br>
+/// Creates a global destructor function that unregisters all GPU code blobs<br>
+/// registered by constructor.<br>
+/// \code<br>
+/// void __cuda_module_dtor(void*) {<br>
+///     __cudaUnregisterFatBinary(Handle0);<br>
+///     ...<br>
+///     __cudaUnregisterFatBinary(HandleN);<br>
+/// }<br>
+/// \endcode<br>
+llvm::Function *CGNVCUDARuntime::makeModuleDtorFunction() {<br>
+  // void __cudaUnregisterFatBinary(void ** handle);<br>
+  llvm::Constant *UnregisterFatbinFunc = CGM.CreateRuntimeFunction(<br>
+      llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),<br>
+      "__cudaUnregisterFatBinary");<br>
+<br>
+  llvm::Function *ModuleDtorFunc = llvm::Function::Create(<br>
+      llvm::FunctionType::get(VoidTy, VoidPtrTy, false),<br>
+      llvm::GlobalValue::InternalLinkage, "__cuda_module_dtor", &TheModule);<br>
+  llvm::BasicBlock *DtorEntryBB =<br>
+      llvm::BasicBlock::Create(Context, "entry", ModuleDtorFunc);<br>
+  CGBuilderTy DtorBuilder(Context);<br>
+  DtorBuilder.SetInsertPoint(DtorEntryBB);<br>
+<br>
+  for (llvm::GlobalVariable *GpuBinaryHandle : GpuBinaryHandles) {<br>
+    DtorBuilder.CreateCall(UnregisterFatbinFunc,<br>
+                           DtorBuilder.CreateLoad(GpuBinaryHandle, false));<br>
+  }<br>
+<br>
+  DtorBuilder.CreateRetVoid();<br>
+  return ModuleDtorFunc;<br>
+}<br>
+<br>
 CGCUDARuntime *CodeGen::CreateNVCUDARuntime(CodeGenModule &CGM) {<br>
   return new CGNVCUDARuntime(CGM);<br>
 }<br>
<br>
Modified: cfe/trunk/lib/CodeGen/CGCUDARuntime.h<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDARuntime.h?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CGCUDARuntime.h?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/lib/CodeGen/CGCUDARuntime.h (original)<br>
+++ cfe/trunk/lib/CodeGen/CGCUDARuntime.h Thu May  7 14:34:16 2015<br>
@@ -16,6 +16,10 @@<br>
 #ifndef LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H<br>
 #define LLVM_CLANG_LIB_CODEGEN_CGCUDARUNTIME_H<br>
<br>
+namespace llvm {<br>
+class Function;<br>
+}<br>
+<br>
 namespace clang {<br>
<br>
 class CUDAKernelCallExpr;<br>
@@ -39,10 +43,17 @@ public:<br>
   virtual RValue EmitCUDAKernelCallExpr(CodeGenFunction &CGF,<br>
                                         const CUDAKernelCallExpr *E,<br>
                                         ReturnValueSlot ReturnValue);<br>
-<br>
-  virtual void EmitDeviceStubBody(CodeGenFunction &CGF,<br>
-                                  FunctionArgList &Args) = 0;<br>
<br>
+  /// Emits a kernel launch stub.<br>
+  virtual void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) = 0;<br>
+<br>
+  /// Constructs and returns a module initialization function or nullptr if it's<br>
+  /// not needed. Must be called after all kernels have been emitted.<br>
+  virtual llvm::Function *makeModuleCtorFunction() = 0;<br>
+<br>
+  /// Returns a module cleanup function or nullptr if it's not needed.<br>
+  /// Must be called after ModuleCtorFunction<br>
+  virtual llvm::Function *makeModuleDtorFunction() = 0;<br>
 };<br>
<br>
 /// Creates an instance of a CUDA runtime class.<br>
<br>
Modified: cfe/trunk/lib/CodeGen/CodeGenFunction.cpp<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenFunction.cpp?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenFunction.cpp?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/lib/CodeGen/CodeGenFunction.cpp (original)<br>
+++ cfe/trunk/lib/CodeGen/CodeGenFunction.cpp Thu May  7 14:34:16 2015<br>
@@ -878,7 +878,7 @@ void CodeGenFunction::GenerateCode(Globa<br>
   else if (getLangOpts().CUDA &&<br>
            !getLangOpts().CUDAIsDevice &&<br>
            FD->hasAttr<CUDAGlobalAttr>())<br>
-    CGM.getCUDARuntime().EmitDeviceStubBody(*this, Args);<br>
+    CGM.getCUDARuntime().emitDeviceStub(*this, Args);<br>
   else if (isa<CXXConversionDecl>(FD) &&<br>
            cast<CXXConversionDecl>(FD)->isLambdaToBlockPointerConversion()) {<br>
     // The lambda conversion to block pointer is special; the semantics can't be<br>
<br>
Modified: cfe/trunk/lib/CodeGen/CodeGenModule.cpp<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenModule.cpp?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/CodeGenModule.cpp?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/lib/CodeGen/CodeGenModule.cpp (original)<br>
+++ cfe/trunk/lib/CodeGen/CodeGenModule.cpp Thu May  7 14:34:16 2015<br>
@@ -350,6 +350,13 @@ void CodeGenModule::Release() {<br>
   if (ObjCRuntime)<br>
     if (llvm::Function *ObjCInitFunction = ObjCRuntime->ModuleInitFunction())<br>
       AddGlobalCtor(ObjCInitFunction);<br>
+  if (Context.getLangOpts().CUDA && !Context.getLangOpts().CUDAIsDevice &&<br>
+      CUDARuntime) {<br>
+    if (llvm::Function *CudaCtorFunction = CUDARuntime->makeModuleCtorFunction())<br>
+      AddGlobalCtor(CudaCtorFunction);<br>
+    if (llvm::Function *CudaDtorFunction = CUDARuntime->makeModuleDtorFunction())<br>
+      AddGlobalDtor(CudaDtorFunction);<br>
+  }<br>
   if (PGOReader && PGOStats.hasDiagnostics())<br>
     PGOStats.reportDiagnostics(getDiags(), getCodeGenOpts().MainFileName);<br>
   EmitCtorList(GlobalCtors, "llvm.global_ctors");<br>
@@ -3678,4 +3685,3 @@ void CodeGenModule::EmitOMPThreadPrivate<br>
       CXXGlobalInits.push_back(InitFunction);<br>
   }<br>
 }<br>
-<br>
<br>
Modified: cfe/trunk/lib/Frontend/CompilerInvocation.cpp<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInvocation.cpp?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInvocation.cpp?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/lib/Frontend/CompilerInvocation.cpp (original)<br>
+++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp Thu May  7 14:34:16 2015<br>
@@ -651,6 +651,9 @@ static bool ParseCodeGenArgs(CodeGenOpti<br>
                       Args.getAllArgValues(OPT_fsanitize_recover_EQ), Diags,<br>
                       Opts.SanitizeRecover);<br>
<br>
+  Opts.CudaGpuBinaryFileNames =<br>
+      Args.getAllArgValues(OPT_fcuda_include_gpubinary);<br>
+<br>
   return Success;<br>
 }<br>
<br>
<br>
Modified: cfe/trunk/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a><br>
URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/test/CodeGenCUDA/device-stub.cu?rev=236765&r1=236764&r2=236765&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/test/CodeGenCUDA/device-stub.cu?rev=236765&r1=236764&r2=236765&view=diff</a><br>
==============================================================================<br>
--- cfe/trunk/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a> (original)<br>
+++ cfe/trunk/test/CodeGenCUDA/<a href="http://device-stub.cu" target="_blank">device-stub.cu</a> Thu May  7 14:34:16 2015<br>
@@ -1,7 +1,21 @@<br>
-// RUN: %clang_cc1 -emit-llvm %s -o - | FileCheck %s<br>
+// RUN: %clang_cc1 -emit-llvm %s -fcuda-include-gpubinary %s -o - | FileCheck %s<br>
<br>
 #include "Inputs/cuda.h"<br>
<br>
+// Make sure that all parts of GPU code init/cleanup are there:<br>
+// * constant unnamed string with the kernel name<br>
+// CHECK: private unnamed_addr constant{{.*}}kernelfunc{{.*}}\00", align 1<br>
+// * constant unnamed string with GPU binary<br>
+// CHECK: private unnamed_addr constant{{.*}}\00"<br>
+// * constant struct that wraps GPU binary<br>
+// CHECK: @__cuda_fatbin_wrapper = internal constant { i32, i32, i8*, i8* }<br>
+// CHECK:       { i32 1180844977, i32 1, {{.*}}, i64 0, i64 0), i8* null }<br>
+// * variable to save GPU binary handle after initialization<br>
+// CHECK: @__cuda_gpubin_handle = internal global i8** null<br>
+// * Make sure our constructor/destructor was added to global ctor/dtor list.<br>
+// CHECK: @llvm.global_ctors = appending global {{.*}}@__cuda_module_ctor<br>
+// CHECK: @llvm.global_dtors = appending global {{.*}}@__cuda_module_dtor<br>
+<br>
 // Test that we build the correct number of calls to cudaSetupArgument followed<br>
 // by a call to cudaLaunch.<br>
<br>
@@ -11,3 +25,28 @@<br>
 // CHECK: call{{.*}}cudaSetupArgument<br>
 // CHECK: call{{.*}}cudaLaunch<br>
 __global__ void kernelfunc(int i, int j, int k) {}<br>
+<br>
+// Test that we've built correct kernel launch sequence.<br>
+// CHECK: define{{.*}}hostfunc<br>
+// CHECK: call{{.*}}cudaConfigureCall<br>
+// CHEKC: call{{.*}}kernelfunc<br>
+void hostfunc(void) { kernelfunc<<<1, 1>>>(1, 1, 1); }<br>
+<br>
+// Test that we've built a function to register kernels<br>
+// CHECK: define internal void @__cuda_register_kernels<br>
+// CHECK: call{{.*}}cudaRegisterFunction(i8** %0, {{.*}}kernelfunc<br>
+<br>
+// Test that we've built contructor..<br>
+// CHECK: define internal void @__cuda_module_ctor<br>
+//   .. that calls __cudaRegisterFatBinary(&__cuda_fatbin_wrapper)<br>
+// CHECK: call{{.*}}cudaRegisterFatBinary{{.*}}__cuda_fatbin_wrapper<br>
+//   .. stores return value in __cuda_gpubin_handle<br>
+// CHECK-NEXT: store{{.*}}__cuda_gpubin_handle<br>
+//   .. and then calls __cuda_register_kernels<br>
+// CHECK-NEXT: call void @__cuda_register_kernels<br>
+<br>
+// Test that we've created destructor.<br>
+// CHECK: define internal void @__cuda_module_dtor<br>
+// CHECK: load{{.*}}__cuda_gpubin_handle<br>
+// CHECK-NEXT: call void @__cudaUnregisterFatBinary<br>
+<br>
<br>
<br>
_______________________________________________<br>
cfe-commits mailing list<br>
<a href="mailto:cfe-commits@cs.uiuc.edu" target="_blank">cfe-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear="all"><div><br></div></div></div><span><font color="#888888">-- <br><div><div dir="ltr">--Artem Belevich</div></div>
</font></span></div>
</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="HOEnZb"><font color="#888888">-- <br><div><div dir="ltr">--Artem Belevich</div></div>
</font></span></div></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr">--Artem Belevich</div></div>
</div>