<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Clang and LLVM IR on PowerPC, and getting: CUDA failure: ‘Invalid device function’"

   href="https://bugs.llvm.org/show_bug.cgi?id=49983">49983</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Clang and LLVM IR on PowerPC, and getting: CUDA failure: ‘Invalid device function’

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>11.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>Other

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>other

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>CUDA

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>amir.sojoodi@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>I am trying to optimize a CUDA code with LLVM passes on a PowerPC system (RHEL

7.6 with no root access) equipped with V100 GPUs, CUDA 10.1, and LLVM 11 (built

from source). Also, I tested *clang, lli,* and *opt* on a simple C++ code

(without any CUDA calls), and everything works just fine.

I can also successfully compile and run a CUDA source code with `clang++`, but

the problem occurs when I want to do the compilation steps manually (even

without any LLVM passes)

Down here, I managed to compile a simple CUDA source, the usual *axpy*:

```lang-c-like

#include <iostream>

#define cudaCheckError()                                       \

  {                                                            \

    cudaError_t e = cudaGetLastError();                        \

    if (e != cudaSuccess) {                                    \

      printf("Cuda failure %s:%d: '%s'\n", __FILE__, __LINE__, \

             cudaGetErrorString(e));                           \

      exit(EXIT_FAILURE);                                      \

    }                                                          \

  }

__global__ void axpy(float a, float* x, float* y) {

  y[threadIdx.x] = a * x[threadIdx.x];

}

int main(int argc, char* argv[]) {

  const int kDataLen = 4;

  float a = 2.0f;

  float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};

  float host_y[kDataLen];

  // Copy input data to device.

  float* device_x;

  float* device_y;

  cudaMalloc(&device_x, kDataLen * sizeof(float));

  cudaMalloc(&device_y, kDataLen * sizeof(float));

  cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),

             cudaMemcpyHostToDevice);

  // Launch the kernel.

  axpy<<<1, kDataLen>>>(a, device_x, device_y);

  cudaCheckError();

  // Copy output data to host.

  cudaDeviceSynchronize();

  cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),

             cudaMemcpyDeviceToHost);

  // Print the results.

  for (int i = 0; i < kDataLen; ++i) {

    std::cout << "y[" << i << "] = " << host_y[i] << "\n";

  }

  cudaDeviceReset();

  return 0;

}

```

With this Makefile like this:

```lang-mk

BIN_FILE=axpy

SRC_FILE=$(BIN_FILE).cu

main: $(BIN_FILE)

$(BIN_FILE).ll: $(SRC_FILE)

        clang++ -stdlib=libc++ -Wall $(SRC_FILE) --cuda-host-only

--cuda-gpu-arch=sm_70 -S -c -emit-llvm

$(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll: $(SRC_FILE)

        clang++ -stdlib=libc++ -Wall $(SRC_FILE) --cuda-device-only

--cuda-gpu-arch=sm_70 -S -c -emit-llvm

$(BIN_FILE).ptx: $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll

        llc -march=nvptx64 -mcpu=sm_70

$(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll -o $(BIN_FILE).ptx

$(BIN_FILE)_dlink.o: $(BIN_FILE).ptx

        ptxas -m64 --gpu-name=sm_70 $(BIN_FILE).ptx -o $(BIN_FILE).ptx.o

        fatbinary --64 --create $(BIN_FILE).fatbin

--image=profile=sm_70,file=$(BIN_FILE).ptx.o \

                         --image=profile=compute_70,file=$(BIN_FILE).ptx

        nvcc $(BIN_FILE).fatbin -arch=sm_70 -dlink -o $(BIN_FILE)_dlink.o

-rdc=true

# For the host code:

$(BIN_FILE).o: $(BIN_FILE).ll

        llc -mcpu=ppc64le $(BIN_FILE).ll -o $(BIN_FILE).s

        clang++ -c $(BIN_FILE).s -o $(BIN_FILE).o

# Link both object files together with a linker:

$(BIN_FILE): $(BIN_FILE).o $(BIN_FILE)_dlink.o

        nvcc $(BIN_FILE).o $(BIN_FILE)_dlink.o -o $(BIN_FILE) -arch=sm_70 -lc++

clean:

        rm *.ll *.s *.ptx *.ptx.o *.fatbin $(BIN_FILE) $(BIN_FILE).o

$(BIN_FILE)_dlink.o

```

It seems all the steps run smoothly without any warning, but after running the

generated executable file, I get this error: 

`Cuda failure axpy.cu:33: 'invalid device function'`

I have also replaced the last linker command with the following, and it is

compiled without any problem, but with the same runtime error.

```lang-mk

clang++ -stdlib=libc++ $(BIN_FILE).o $(BIN_FILE)_dlink.o -o $(BIN_FILE) -lcuda

-lcudart -lcudadevrt \

                -L/path-to-gcc-lib/ 

```

The output of `cuobjdump axpy.fatbin -ptx -sass`:

```

Fatbin elf code:

================

arch = sm_70

code version = [1,7]

producer = <unknown>

host = linux

compile_size = 64bit

code for sm_70

                Function : _Z4axpyfPfS_

.headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"

   /*0000*/           MOV R1, c[0x0][0x28] ;                       /*

0x00000a0000017a02 */

                                                                   /*

0x000fd00000000f00 */

   /*0010*/      @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                /*

0x000000fffffff389 */

                                                                   /*

0x000fe200000e00ff */

   /*0020*/           IADD3 R1, R1, -0x18, RZ ;                    /*

0xffffffe801017810 */

                                                                   /*

0x000fe20007ffe0ff */

   /*0030*/           IMAD.MOV.U32 R9, RZ, RZ, c[0x0][0x16c] ;     /*

0x00005b00ff097624 */

                                                                   /*

0x000fe200078e00ff */

   /*0040*/           MOV R8, c[0x0][0x168] ;                      /*

0x00005a0000087a02 */

                                                                   /*

0x000fe40000000f00 */

   /*0050*/           IADD3 R2, P0, R1, c[0x0][0x20], RZ ;         /*

0x0000080001027a10 */

                                                                   /*

0x000fc80007f1e0ff */

   /*0060*/           IADD3.X R3, RZ, c[0x0][0x24], RZ, P0, !PT ;  /*

0x00000900ff037a10 */

                                                                   /*

0x000fd000007fe4ff */

   /*0070*/           ST.E.64.SYS [R2+0x8], R8 ;                   /*

0x0000000802007385 */

                                                                   /*

0x0001e8000010eb08 */

   /*0080*/           LD.E.64.SYS R4, [R2+0x8] ;                   /*

0x0000000802047980 */

                                                                   /*

0x000ea2000010eb00 */

   /*0090*/           IMAD.MOV.U32 R10, RZ, RZ, c[0x0][0x170] ;    /*

0x00005c00ff0a7624 */

                                                                   /*

0x000fe200078e00ff */

   /*00a0*/           MOV R11, c[0x0][0x174] ;                     /*

0x00005d00000b7a02 */

                                                                   /*

0x000fe20000000f00 */

   /*00b0*/           IMAD.MOV.U32 R15, RZ, RZ, c[0x0][0x160] ;    /*

0x00005800ff0f7624 */

                                                                   /*

0x000fe200078e00ff */

   /*00c0*/           S2R R13, SR_TID.X ;                          /*

0x00000000000d7919 */

                                                                   /*

0x000eac0000002100 */

   /*00d0*/           ST.E.64.SYS [R2+0x10], R10 ;                 /*

0x0000001002007385 */

                                                                   /*

0x000fe8000010eb0a */

   /*00e0*/           LD.E.64.SYS R6, [R2+0x10] ;                  /*

0x0000001002067980 */

                                                                   /*

0x000ee8000010eb00 */

   /*00f0*/           ST.E.SYS [R2], R15 ;                         /*

0x0000000002007385 */

                                                                   /*

0x000fe8000010e90f */

   /*0100*/           LD.E.SYS R0, [R2] ;                          /*

0x0000000002007980 */

                                                                   /*

0x000e22000010e900 */

   /*0110*/           IMAD.WIDE.U32 R4, R13, 0x4, R4 ;             /*

0x000000040d047825 */

                                                                   /*

0x004fd400078e0004 */

   /*0120*/           LD.E.SYS R5, [R4] ;                          /*

0x0000000004057980 */

                                                                   /*

0x000e22000010e900 */

   /*0130*/           IMAD.WIDE.U32 R6, R13, 0x4, R6 ;             /*

0x000000040d067825 */

                                                                   /*

0x008fe400078e0006 */

   /*0140*/           FMUL R9, R0, R5 ;                            /*

0x0000000500097220 */

                                                                   /*

0x001fd00000400000 */

   /*0150*/           ST.E.SYS [R6], R9 ;                          /*

0x0000000006007385 */

                                                                   /*

0x000fe2000010e909 */

   /*0160*/           EXIT ;                                       /*

0x000000000000794d */

                                                                   /*

0x000fea0003800000 */

   /*0170*/           BRA 0x170;                                   /*

0xfffffff000007947 */

                                                                   /*

0x000fc0000383ffff */

        .......................

```

The output ends with dots. Also, `nvdisasm axpy` throws this error: `nvdisasm

fatal   : axpy is not a supported Elf file`

I suspect that the problem is originated in the Makefile steps related to the

`fat binary`. Figure 1 in this

[link](<a href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory">https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory</a>)

includes the fat binary creation steps, which might be useful.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>