[llvm] [DAGCombiner] Lower dynamic insertelt chain more efficiently (PR #162368)

Tue Oct 14 11:09:49 PDT 2025

================
@@ -0,0 +1,360 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mcpu=sm_20 | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -mcpu=sm_20 | %ptxas-verify %}
+target triple = "nvptx64-nvidia-cuda"
+
+; COM: Save the vector to the stack once.
+define ptx_kernel void @lower_once(ptr addrspace(3) %shared.mem, <8 x double> %vector, i32 %idx0, i32 %idx1, i32 %idx2, i32 %idx3) local_unnamed_addr {
----------------
Artem-B wrote:

Those are fairly large test cases. Can they be further reduced? 

- Do we need to operate on double? Using smaller type may reduce the number of loads/stores we may need to do.
- Can we use literal element values? Then we would not need to generate load instructions.
- In some cases we can reduce vector size. We do not need 8-element vector for something that modifies just one element.

https://github.com/llvm/llvm-project/pull/162368