[llvm] [AArch64][SVE] Fold ADD+CNTB to INCB and DECB (PR #118280)

Mon Dec 9 03:05:13 PST 2024

rj-jesus wrote:

Hi @david-arm and @sjoerdmeijer, many thanks for the feedback. I'm very sorry I haven't addressed it yet, last week was a bit busy!

I think in most cases where these patterns appear (e.g. `i+svcntb()`), they'll be incrementing a loop's induction variable/pointer by multiples of the VL, so having separate source/destination registers is probably uncommon. However, I do agree that sequences of:
```
mov x9, x8
incb x9
```
Look ugly and should be avoided (even though the particular test causing this is a bit artificial and doesn't seem to be concerned with the particular pattern of ADDVL or MOV+INCB used).

@david-arm do you have any particular codes you're concerned about?

>>    Hi @sjoerdmeijer, but it's really not obvious to me that incb gives better or same performance for immediate values that aren't 1, 2 or 4. In fact, I wouldn't be surprised if it gave worse performance in some circumstances which is why I wonder if we should be more cautious here?

> @rj-jesus : can you micro-benchmark this?

Sure, here's the latency of using the corresponding sequences (normalised to the latency of a simple ADD):
```
ADD: 1
INCB #1: 1
INCB #16: 2
MOV+INCB #1: 1.6
MOV+INCB #16: 2.6
ADDVL #1: 2
ADDVL #16: 2
```
The Neoverse V2 SWOG is a bit vague about the conditions under which MOV Xd, Xn "may not be executed with zero latency", which the micro-benchmark seems to hit (hence the 60% increased latency for these patterns). Nevertheless, even for these cases the MOV patterns with fast INCB still seem at least not worse than ADDVL from the viewpoint of latency.

https://github.com/llvm/llvm-project/pull/118280