[PATCH] D156112: [AArch64][LoopVectorize] Improve tail-folding heuristic on neoverse-v1

Tue Jul 25 04:20:20 PDT 2023

igor.kirillov added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:3789
+  // insufficient computation between comparisons can slow down the code.
+  return NumInsns >= SVETailFoldInsnThreshold * NumComparisons;
 }
----------------
david-arm wrote:
> At first glance this feels a little brutal - doubling (or even tripling!) the threshold when there is an extra compare in the loop. Also, I would have thought that after adding one or two more compares we should really hit a plateau for the threshold because at the point the volume of compares is more likely to be the bottleneck after filling up all the pipelines?
> 
> @igor.kirillov What loops have you tested this on and have you established what the minimum thresholds required to prevent tail-folding are? I just wonder if instead of multiplying the threshold you can actually do something like this:
> 
>   unsigned AdditionalInsns = NumComparisons > 1 ? 5 : 0;
>   return NumInsns >= (SVETailFoldInsnThreshold  + AdditionalInsns );
> 
> If you haven't done so already, then I think it's worth collecting some data on what thresholds are really needed.
Brutal but effective :)
It's hard to make a simple yet precise heuristic. I tried to see how many computational instructions we need after comparison to make this problem disappear, and it is around ten fmul/fadd instructions. If there are memory access instructions among them, we need less.

If we have a loop like this:
```
for (Index_type j = 0; j < N; ++j) {
  pout[j] = pin1[j] < pin2[j] ? pin1[j] : pin2[j];
}
```
Then if N is big and this loop is executed just several times, the throughput problem is completely overshadowed by slow memory accesses. If N is small and the loop is executed around N times, then the throughput problem comes forward, and this loop can be up to 2 times slower.

In the heuristic I've just added, we don't care about memory operations and the number of instructions BETWEEN comparisons, but I doubt it is worth adding, at least for now.

For the purposes of the least disruption effect on benchmarks, I would ask for 5 extra instructions for each comparison to allow predicated tail-folding. (One regressed benchmark has 16 instructions and one extra comparison, and the other has 24 instructions and 2 extra comparisons). The question is how this code should behave when a user passes a different `sve-tail-folding-insn-threshold` value.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D156112/new/

https://reviews.llvm.org/D156112