[Openmp-dev] CUDA error is: invalid device ordinal
    Ye Luo via Openmp-dev 
    openmp-dev at lists.llvm.org
       
    Sun Jun 28 08:23:08 PDT 2020
    
    
  
No need to bother the OLCF support. I resolved the problem when
investigating another issue.
libomptarget caused the problem.
https://reviews.llvm.org/D82718
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory
On Tue, Jun 9, 2020 at 6:45 PM Johannes Doerfert <johannesdoerfert at gmail.com>
wrote:
> Sounds "good". So we "just" need to figure out which of these it is ;)
>
> @Ye, you'll talk to the Summit admins, correct?
>
>
> On 6/9/20 9:43 AM, Alexey Bataev wrote:
>
> Yes, the error appears when the library tries to get the configuration of the gpu device. There can be a race condition, probably, but in most cases this error just means that there is something wrong either with the Cuda, or the device, or both.
>
> Best regards,
> Alexey Bataev
>
> 9 июня 2020 г., в 10:40, Ye Luo <xw111luoye at gmail.com> <xw111luoye at gmail.com> написал(а):
>
> 
> @Johannes This is not related to the race I mentioned to you yesterday. The error shows up at the very beginning of the execution in serial and only appears on Summit.
> I can run the full code on x86_64 with CUDA 11 installed. On X86_64 with CUDA < 11, the code passes that point but stops in another place.
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Tue, Jun 9, 2020 at 9:33 AM Johannes Doerfert <johannesdoerfert at gmail.com<mailto:johannesdoerfert at gmail.com> <johannesdoerfert at gmail.com>> wrote:
>
> @Alexey Why do you think it is a CUDA error and not a race in the libomptarget?
>
> @Ye Can we run this on a different system too?
>
>
>
> On 6/9/20 8:19 AM, Ye Luo via Openmp-dev wrote:
>
> It is on the Summit supercomputer. I will ask the administrators for help.
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Tue, Jun 9, 2020 at 6:02 AM Alexey.Bataev <a.bataev at outlook.com> <a.bataev at outlook.com><mailto:a.bataev at outlook.com> <a.bataev at outlook.com> wrote:
>
>
>
> Hi, most probably there is something wrong with CUDA installation or GPU
> config. Try to reinstall CUDA at first.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 08.06.2020 10:50 PM, Ye Luo via Openmp-dev пишет:
>
> Hi all,
> Hopefully I can get some insights from the wider community.
> My application runs fine on x86-64 + CUDA.
> When I built the same version of clang and application on Power9+V100, I
> got "CUDA error is: invalid device ordinal". It seems that the cuda plugin
> got the device 0 but failed to create a context. I paste the debug + nvprof
> output at the end of this email.
> I used the same compiler to build a small test program. It runs fine.
> What can be a potential cause of this CUDA error?
> Ye
>
> Libomptarget --> Call to omp_get_num_devices returning 1
> Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices
> were found)
> Libomptarget --> Entering data begin region for device -1 with 1 mappings
> Libomptarget --> Use default device id 0
> Libomptarget --> Checking whether device 0 is ready.
> Libomptarget --> Is the device 0 (local ID 0) initialized? 0
> Target CUDA RTL --> Init requires flags to 1
> Target CUDA RTL --> Getting device 0
> Target CUDA RTL --> Error returned from cuCtxCreate
> Target CUDA RTL --> CUDA error is: invalid device ordinal
> Libomptarget --> Failed to init device 0
> Libomptarget --> Device 0 is not ready.
> Libomptarget --> Failed to get device 0 ready
> Libomptarget fatal error 1: failure of target construct while offloading
> is mandatory
> ==176195== Profiling application: ../../../../bin/qmcpack
> qmc_short_vmcbatch.in.xml
> Libomptarget --> Unloading target library!
> Libomptarget --> Image 0x00000000107b6470 is compatible with RTL
> 0x000000003b329020!
> Libomptarget --> Unregistered image 0x00000000107b6470 from RTL
> 0x000000003b329020!
> Libomptarget --> Done unregistering images!
> Libomptarget --> Removing translation table for descriptor
> 0x0000000010900318
> Libomptarget --> Done unregistering library!
> Libomptarget --> Deinit target library!
> ==176195== Profiling result:
> No kernels were profiled.
>             Type  Time(%)      Time     Calls       Avg       Min
> Max  Name
>       API calls:   87.10%  1.75034s         7  250.05ms  250.00ms
>  250.28ms  cudaFree
>                    12.02%  241.59ms         1  241.59ms  241.59ms
>  241.59ms  cuDevicePrimaryCtxRelease
>                     0.42%  8.4971ms         1  8.4971ms  8.4971ms
>  8.4971ms  cuCtxCreate
>                     0.31%  6.1826ms         3  2.0609ms  827.87us
>  3.7271ms  cuModuleUnload
>                     0.08%  1.5932ms        97  16.424us     241ns
>  652.53us  cuDeviceGetAttribute
>                     0.05%  1.0525ms         1  1.0525ms  1.0525ms
>  1.0525ms  cuDeviceTotalMem
>                     0.01%  209.36us         1  209.36us  209.36us
>  209.36us  cuDeviceGetName
>                     0.00%  73.862us         7  10.551us  4.6310us
>  28.909us  cudaSetDevice
>                     0.00%  4.3990us         3  1.4660us     543ns
>  2.6840us  cuDeviceGet
>                     0.00%  3.9920us         1  3.9920us  3.9920us
>  3.9920us  cuDeviceGetPCIBusId
>                     0.00%  3.0740us         1  3.0740us  3.0740us
>  3.0740us  cudaGetDeviceCount
>                     0.00%  3.0000us         4     750ns     407ns
>  1.2090us  cuDeviceGetCount
>                     0.00%  2.1410us         1  2.1410us  2.1410us
>  2.1410us  cuInit
>                     0.00%  2.1080us         1  2.1080us  2.1080us
>  2.1080us  cuDriverGetVersion
>                     0.00%  1.9570us         1  1.9570us  1.9570us
>  1.9570us  cuGetErrorString
>                     0.00%  1.2870us         1  1.2870us  1.2870us
>  1.2870us  cuCtxSetCurrent
>                     0.00%     393ns         1     393ns     393ns
> 393ns  cuDeviceGetUuid
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev<mailto:listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev> <listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>
>
>
>
>
>
> _______________________________________________
> Openmp-dev mailing listOpenmp-dev at lists.llvm.org<mailto:Openmp-dev at lists.llvm.org> <Openmp-dev at lists.llvm.org>https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>
>
> _______________________________________________
> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200628/c7da4b56/attachment-0001.html>
    
    
More information about the Openmp-dev
mailing list