efriedma-quic wrote: Is there some reason to prefer that sequence over a shorter sequence, like a pair of ld1r followed by a zip1? I mean, I can imagine your sequence is faster on certain CPUs, but I'd want to document the reasoning. https://github.com/llvm/llvm-project/pull/78632