[llvm] [NVPTX] Skip processing BasicBlocks with single unreachable instruction in `nvptx-lower-unreachable` pass. (PR #72641)

Mon Mar 4 08:49:52 PST 2024

================
@@ -138,7 +138,19 @@ bool NVPTXLowerUnreachable::runOnFunction(Function &F) {
   InlineAsm *Exit = InlineAsm::get(ExitFTy, "exit;", "", true);
 
   bool Changed = false;
-  for (auto &BB : F)
+
+  // In scenarios where a BasicBlock contains only one unreachable instruction,
+  // the joint action of nvptx-isel and unreachable-mbb-elimination
+  // effectively optimizes the BasicBlock out. However, adding an exit
+  // command to such a BasicBlock, as suggested by this pass, preserves it
+  // within the Control Flow Graph (CFG), thereby negatively impacting size and
+  // performance. To counteract this undesirable consequence, we choose to
+  // refrain from processing BasicBlocks with just one unreachable instruction
+  // in this pass.
+
----------------
mmoadeli wrote:

@Artem-B 

[ptx](https://1drv.ms/f/s!Art4q8XrOnV7m2-K7fHC2nBLnlw3?e=ieu7cF)

Running the shared code built using syclos 8e921930 revision (related with [D152789](https://reviews.llvm.org/D152789)) (with --save-temps added to the last command in ethminier/ethminer/build.sh) produces a buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-ae5a30.s file (811816 bytes) and reports the following measurements:

```
Total Execution Time: 61.0493 s
Total Number of Hashes: 9875488768
Overall Hash rate: 161.763 MH/s

```
Manually modifying the above ptx file to `buildinfo-sycl-nvptx64-nvidia-cuda-sm_50-204989.s` by moving the `BasicBlock` having `exit` added by the original `nvptx-lower-unreachable` pass in [D152789](https://reviews.llvm.org/D152789) and rebuilding the binary yields the following measurements:

```
Total Execution Time: 61.177 s
Total Number of Hashes: 14647558144
Overall Hash rate: 239.429 MH/s
```
which is a massive `%47` performance improvement.

We can't pinpoint a specific issue as to why a minor relocation of an added basic block back to its original position in the CFG, which is typically moved to the end of the CFG by block-placement, can have such a significant impact. Obviously, widening divergent areas could be a potential reason, which the original PR has aimed to address.
It's also challenging to modify the pass to prevent that particular basic block from being affected by optimisation passes. Such changes might not be straightforward and could introduce some complexity that may not align well with the the code standards. For instance, it may be achieved by having one extra pass to undo works done by block-placement optimisation, which may some don't fancy.

It would be valuable to have your input, and possibly input from @maleadt as well.

Thanks


https://github.com/llvm/llvm-project/pull/72641