[Openmp-commits] [PATCH] D106774: [libomptarget][amdgpu] More robust handling of failure to init HSA

Jon Chesterfield via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Sun Jul 25 13:49:01 PDT 2021


JonChesterfield created this revision.
JonChesterfield added reviewers: jdoerfert, tianshilei1992, ronlieb, pdhaliwal, grokos, ye-luo.
Herald added subscribers: kerbowa, t-tye, tpr, dstuttard, yaxunl, nhaehnle, jvesely, kzhuravl.
JonChesterfield requested review of this revision.
Herald added subscribers: openmp-commits, wdng.
Herald added a project: OpenMP.

If hsa_init fails, subsequent calls into hsa are not safe. Except for
hsa_init, but we don't retry on failure.

This patch:

- deletes a print that called into hsa to ask why it can't call into hsa
- drops a merge conflict block next to that print
- reliably initializes number of devices to zero
- skips the plugin destructor contents if the constructor failed to init hsa

Tested by making hsa_init return error, and by forcing the dynamic library
use which was then deleted from disk. Before this patch, both segv. After it,
friendly message about offloading being unavailable.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D106774

Files:
  openmp/libomptarget/plugins/amdgpu/impl/system.cpp
  openmp/libomptarget/plugins/amdgpu/src/rtl.cpp


Index: openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
===================================================================
--- openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
+++ openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
@@ -436,13 +436,14 @@
 /// Class containing all the device information
 class RTLDeviceInfoTy {
   std::vector<std::list<FuncOrGblEntryTy>> FuncGblEntries;
+  bool HSAInitializeSucceeded = false;
 
 public:
   // load binary populates symbol tables and mutates various global state
   // run uses those symbol tables
   std::shared_timed_mutex load_run_lock;
 
-  int NumberOfDevices;
+  int NumberOfDevices = 0;
 
   // GPU devices
   std::vector<hsa_agent_t> HSAAgents;
@@ -688,7 +689,9 @@
 
     DP("Start initializing HSA-ATMI\n");
     hsa_status_t err = core::atl_init_gpu_context();
-    if (err != HSA_STATUS_SUCCESS) {
+    if (err == HSA_STATUS_SUCCESS) {
+      HSAInitializeSucceeded = true;
+    } else {
       DP("Error when initializing HSA-ATMI\n");
       return;
     }
@@ -791,6 +794,10 @@
 
   ~RTLDeviceInfoTy() {
     DP("Finalizing the HSA-ATMI DeviceInfo.\n");
+    if (!HSAInitializeSucceeded) {
+      // Then none of these can have been set up and they can't be torn down
+      return;
+    }
     // Run destructors on types that use HSA before
     // atmi_finalize removes access to it
     deviceStateStore.clear();
@@ -969,6 +976,9 @@
 
   // this is per device id init
   DP("Initialize the device id: %d\n", device_id);
+  if ((device_id < 0) || (device_id >= DeviceInfo.HSAAgents.size())) {
+    return OFFLOAD_FAIL;
+  }
 
   hsa_agent_t agent = DeviceInfo.HSAAgents[device_id];
 
Index: openmp/libomptarget/plugins/amdgpu/impl/system.cpp
===================================================================
--- openmp/libomptarget/plugins/amdgpu/impl/system.cpp
+++ openmp/libomptarget/plugins/amdgpu/impl/system.cpp
@@ -356,12 +356,8 @@
   DEBUG_PRINT("Initializing HSA...");
   hsa_status_t err = hsa_init();
   if (err != HSA_STATUS_SUCCESS) {
-    printf("[%s:%d] %s failed: %s\n", __FILE__, __LINE__,
-           "Initializing the hsa runtime", get_error_string(err));
     return err;
   }
-  if (err != HSA_STATUS_SUCCESS)
-    return err;
 
   err = init_compute_and_memory();
   if (err != HSA_STATUS_SUCCESS)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D106774.361533.patch
Type: text/x-patch
Size: 2281 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-commits/attachments/20210725/2cc3c3f1/attachment.bin>


More information about the Openmp-commits mailing list