[llvm] [AArch64] Use `ZIP1/2` over `INS` for vector concat (PR #142427)

Tue Jun 3 02:08:52 PDT 2025

Il-Capitano wrote:

> INS is actually faster than a Q register ZIP on some cores. For example, Cortex-A55. Not sure how much we care about this.

Hmm, you're right, I didn't look at the Cortex-A55 optimization guide, I only checked Cortex-A510 and bigger cores, where INS and ZIP have the same latency and throughput. Thanks you for noticing this!

If using ZIP eliminates a MOV as well like below, then it's a definite win on Cortex-A55 as well, but not sure how common that would be.
```asm
mov v0.16b, v1.16b
mov v0.d[1], v2.d[0]
// Can be replaced by
zip1 v0.2d, v1.2d, v2.2d
```

https://github.com/llvm/llvm-project/pull/142427