[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Tue Jul 24 05:58:59 PDT 2018

On Tue, 24 Jul 2018 at 13:46, Hal Finkel via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Ah, interesting. I was indeed wondering whether this was another case where we'd benefit from having vectorizer-maximize-bandwidth on by default. I recall that our defaults have long been to prefer smaller VF in cases where the costs are equal to avoid extra legalization costs (and, potentially, to retain more freedom for interleaving later). Should the costs here be equal, or should we have some extra target information to distinguish them? It sounds like they're not really equal in practice.

Hi Hal,

Enabling vectorizer-maximize-bandwidth by default has been explored
before (see history in https://reviews.llvm.org/D46283) and it created
a handful of problems.

With the correctness problems fixed, we still were blocked on high
negative performance hits, which was probably to do with the cost
model. Adhemerval sent this patch instead:
https://reviews.llvm.org/D48332, which dealt with the specific case in
hand.

I remember hearing that some complex loops get worse performance for
wide loads (ex. AVX512) because it makes the loop too short and the
shuffles in and out too long, or increase the number of shuffles if
the loads are not trivial.

So, while enabling it by default is probably a good idea in a lot of
cases, we probably need to be careful with its usage as a wide scope
default. I don't have more info on the individual cases, though, so
just is more of an FYI. :)

-- 
cheers,
--renato