[llvm-dev] LoopVectorizer: shufflevectors

Tue Sep 4 18:58:38 PDT 2018

>> To me, this looks like something the LoopVectorizer is neglecting and
>> should be combining.
>
>It's not up to the vectoriser to combine code.
>
>But it could be up to the vectoriser to generate less bloated code,
>given it's a small change.
>
>That's my point above.

We should note that
1) Loop Vectorizer is not the only place that generates vectorized IR. For example, programmer's intrinsic vector code, after inlining etc. might show the same problem. Any optimizations added within LV won't be applied when other parts of the compiler is generating vectorized IR.
2) Vectorizer's main job is generating widened vector code that is easier to optimize later on, not necessarily generating highly optimized vector code on its own.
3) Cost modeling correctly (and as a result choosing good VF) is a more important problem, than performing the optimization within the vectorizer itself.
4) If cost modeling is taking optimization into account, LV has a chance of generating optimized code. That doesn't necessarily mean LV should ---- back to 1).

The last thing we want would be making LV a gigantic monolithic optimizer that is so hard to maintain.

I think we should talk about how much complexity we would be adding for general "vectorized load/store optimization", and whether we should have a separate post-vectorizer optimizer doing it (while LV still needs to understand the cost modeling aspect of that optimization, in order to choose the right VF). This should include a discussion about moving interleave memory access optimization from LV to there. Adding a small new optimization here and there to LV can have a snowball effect.

Thanks,
Hideki

==============================
Date: Tue, 4 Sep 2018 18:57:17 +0100
From: Renato Golin via llvm-dev <llvm-dev at lists.llvm.org>
To: 
Cc: LLVM Dev <llvm-dev at lists.llvm.org>, Ulrich Weigand
	<ulrich.weigand at de.ibm.com>
Subject: Re: [llvm-dev] LoopVectorizer: shufflevectors
Message-ID:
	<CAMSE1kcHuN4a-a1VTUdsyyVD_9aThZ6p_N8ZbPhW1H8KoxAJtg at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

On Tue, 4 Sep 2018 at 17:35, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote:
> > It's probably a lot simpler to improve the SystemZ model to "know"
> > have the same arch flags / cost model completeness as the other
> > targets.
> I thought they were - anything particular in mind?

I have no idea about SystemZ, sorry. :)

>From your post and response, it seems that both improving the target
info and cost model are opening new ways to vectorise on SystemZ.

That's what I was referring to.

> This then made many more cases of interleaving happen (~450 cases on
> spec IIRC). Only problem was... the SystemZ backend could not handle
> those shuffles as well in all the cases. To me that looked like
> something to be fixed on the I/R level, and after discussions with
> Sanjay I got the impression that this was the case...

Right. Being fixed at IR level and that being done in the vectoriser
are two different things.

Our current implementation is too monolithic to be trying out
branching off the beaten path, and we're in the process of moving out
(which can still take years), so I don't recommend big refactorings on
the code.

You could probably find a number of simplifications, taking target
info in consideration, that can later be ported to VPlan, but that
will require testing the vectorisation on the supported targets.

We don't need to re-benchmark everything again, just make sure the
code doesn't change for them, of if it does, to know why.

> To me, this looks like something the LoopVectorizer is neglecting and
> should be combining.

It's not up to the vectoriser to combine code.

But it could be up to the vectoriser to generate less bloated code,
given it's a small change.

That's my point above.

> I suppose with my patch for the Load -> Store
> groups, I could add also the handling of recomputed indices so that the
> load group produces a vector that fits the store group directly. But if
> I understand you correctly, even this is not so wise?

It will depend on how much that changes other targets, because what
looks less bloated can also mean patterns are not recognised any more
by other back-ends.

> And if so, then indeed improving the SystemZ DAGCombiner is the only alternative left, I guess...

You'll probably have to do that anyway, but I wouldn't try it unless I
had no other choice. :)

> But having the cost functions available is not enough to drive a later
> I/R pass to optimize the generated vector code? I mean if the target
> indicated which shuffles were expensive, that could then easily be avoided.

Sure, but "expensive" is a relative term and it's intimately linked to
what the back-end can combine.

If you're lucky enough that a mid-end change just happens to unbloat
shuffles and be correctly lowered, without breaking other targets,
then that's a big win.

-- 
cheers,
--renato