[llvm] [AArch64] Add MATCH loops to LoopIdiomVectorizePass (PR #101976)

Thu Nov 21 08:21:44 PST 2024

rj-jesus wrote:

It is, in the table below I've summarised the speedups obtained on a Neoverse V2 for a combination of search (columns) and needle (rows) vector sizes for the loop in the PR's description:
![data](https://github.com/user-attachments/assets/5aca75da-3453-4790-80f9-a463ad946f5d)
For item (4, 80) in the matrix, for example, we have a speedup of 16.1x, which means that, for search arrays of 80 characters and needle arrays of 4 characters, where the match happens at the last element of both arrays, we get an average speedup of 16.1 over the current scalar loops. I can probably share the benchmark I used to collect these numbers if you think that would be useful, I would just need to double-check this internally.

Overall the vectorised loops are at least competitive with the scalar loops (and often obtain substantial speedups over the latter). The exceptions are for degenerate cases of search or needle arrays consisting of a single element (top left corner of the matrix).

The output we generate for the vectorised loops also contains four "unnecessary instructions", in particular:
* `mov	x8, x2` - redundant copy of `mov x9, x2`
* `add	x8, x8, #16` - as above, copy of `add x9, x9, #16`
* `mov	z1.q, q1` - we do this for correctness to broadcast the needles from a segment to a full SVE vector, but this is not necessary here since we effectively only work with the first 128-bit segment of the SVE register
* `ptrue	p0.b` - this is introduced by `experimental.cttz.elts` but is technically not needed in this case since we could reuse the previously defined `p0`

None of these issues are related to the transformation itself (and don't affect its performance that much anyway), so I think they can be addressed separately.

https://github.com/llvm/llvm-project/pull/101976