SchrodingerZhu wrote: I suppose the main cost just comes from spilling the values to stack and then loads them. Inlining them should speed up a lot but make the solution less portable. https://github.com/llvm/llvm-project/pull/101110