[PATCH] D130618: [AArch64][LoopVectorize] Enable tail-folding by default for SVE

Mon Nov 21 07:36:09 PST 2022

SjoerdMeijer added a comment.
Herald added a subscriber: pcwang-thead.

> Hi @SjoerdMeijer, thanks for looking into this. I do actually already have a patch to enable this by default (https://reviews.llvm.org/D130618), where the default behaviour is tuned according to the CPU. I think this is what we want because the profile will change according to what CPU you're running on - some CPUs may handle reductions better than others.

I didn't know what the problem is with reductions. I now see you had a discussion with Dave here about reductions, but it looks it is still unclear why exactly that is a problem (at least to me). 
But I thought that "simple" would be a good start as a default, which is what you're more or less doing here too. Well, you're setting it for a few cpus, but my feeling is that "simple" should be a very safe bet for all.

> The decision in this patch may be incorrect for 128-bit vector implementations. I also ran SPEC2k17 on a SVE-enabled CPU as well and I remember I saw a small (2-3%) regression in parest or something like that, which is one of the reasons I didn't push the patch any further. I also think it's really important to run a much larger set of benchmarks besides SPEC2k17 and collect numbers to show the benefits, since there isn't much vectorisation actually going on in SPEC2k17.

I will collect data for SPEC FP. I don't know exactly how much vectorisation is going on there, but will see. By the way, most vectorisation happens in x264 which sees a significant uplift. My numbers on a 256 bit vector implementation sees neutral results for the other SPEC INT apps. I think this is a very good start.

> One of the major problems with the currrent tail-folding implementation is that we make the decision before doing any cost analysis in the vectoriser, which isn't great because we may be forcing the vectoriser to take different code paths to if we didn't tail-fold. Ideally what we really want is to move to a model where the vectoriser has a two-dimensional matrix of costs considering the combination of VF and vectorisation style (e.g. tail-folding vs whole vector loops, etc.), and choose the most optimal combination.

Yeah, I am aware and noticed this while implementing tail-folding for MVE. This is a problem, but something we have to live with for a while I think as it is not very low hanging fruit to change this. That's my impression at least. However, with the "simple" tail-folding it is difficult to see how it would lead to regressions.

What are your ideas to progress this?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130618/new/

https://reviews.llvm.org/D130618