[clang] [llvm] [NVPTX] Consolidate and cleanup various NVPTXISD nodes (NFC) (PR #145581)

Thu Aug 14 17:58:28 PDT 2025

ThomasRaoux wrote:

@AlexMaclean I compared the runs in ncu and there are no differences in occupancy and the arithmetic usage is roughly the same but I see some large stalls on `SR_CgaCtaId` read in a loop that comes from the extra global_smem copy:
<img width="3026" height="900" alt="image" src="https://github.com/user-attachments/assets/33962f55-1132-49cf-8442-33970055647c" />

This seem to be the main reason for the significant slow down here. It seems like a legit problem from what ptxas generates and I don't think it can be workaround from user point of view.
Can we go back to doing CSE for global_smem move as this seems to help code quality

<img width="1887" height="450" alt="image" src="https://github.com/user-attachments/assets/dabc5330-eb75-4dad-a13b-ce4b59f08ac0" />

<img width="1887" height="450" alt="image" src="https://github.com/user-attachments/assets/fcb4dc15-9883-4a56-a8d3-b689bb057cfb" />

https://github.com/llvm/llvm-project/pull/145581