[llvm-dev] [RFC] Goals for VPlan-based cost modelling

Wed Nov 4 07:20:10 PST 2020

Hello

The gramma idea sounds interesting. My brain tends to just work in terms of code, but it might make a great master thesis :)  The pattern matches were modelled after the existing ones in llvm that are used all over the place, especially instcombine. It was meant to feel familiar with what already existed.

I imagine a lot of people must have thought about the same kinds of ideas and there must be some research out there on it. Either whether it worked well or not, it would be good to look into what was already tried. I imagine there will be cases where it works well and times where it doesn't. The same concept of specifying patterns in some way that isn't C++ code could be said for instcombine or DAG combining too. The fact that no-one ever did it is perhaps a bad sign? I know there are some tablegen defined combines for global isel now, they might give some ideas.

Combining a few instructions into a vmulh is simple enough, and I was hoping that there would not be a huge number of combines needed. There are bigger transforms that perhaps wouldn't fit that very well. Imagine trying to take code that does:
x, y = vld2
x *= -1
vst2 x, y

and turn that into
x = load
x *= [-1,1]
str x

That is like "de-interleaving" an interleaving group, much like slp vectorization. Plus it would need to make that transform generic across all the different operations that could be between the loads/store and on the size of the interleaving groups. It, of course, could get quite complex.

Looking forward to seeing what you come up with.
Dave

--------------------------------------------
Sent: 03 November 2020 15:45
Subject: Re: [llvm-dev] [RFC] Goals for VPlan-based cost modelling 

Hi Dave,

thanks for your detailed reply, the reductions you describe really seem 
like a perfect example of what a new cost model and IR transformation 
should be able to cope with, and so does vmulh and the other 
instructions you mentioned.
In particular, they do underline the importance of finding a solution 
for patterns of arbitrary size, which is much harder than finding a 
feasible solution for patterns of size two like the ones I mentioned.

A thought that has been sitting in my head for a while is whether this 
part of transforming IR (or VP-) instructions into VPRecipes could be 
handled by some kind of grammar, like grammars used for parsing in NLP, 
or also in compilers. Patterns could be represented as grammar rules, 
and if more than one pattern matches the analysis can fork and continue 
on two or more VPlans in parallel. Recipes that can be used in many 
architectures would be in the main grammar. It could be a probabilistic 
grammar with values either 0 or 1, such that architectures that do not 
implement a recipe can set the probability of it ever being generated to 
0. For very target-specifc patterns, target-specific sub-grammars could 
be maintained.
On a totally crazy thought, a cost model could be integrated in this 
grammar step, if the patterns need legality (type) checking anyway like 
it is currently being done in some TTI cost functions. Then the recipes 
can be annotated with their cost at generation time, we maintain a total 
cost for each VPlan and can define a threshold of `cost(VPlan) - 
minVPlanCost' above which we abandon plans on the fly, thus abandoning 
options that are unlikely to be fruitful mid-analysis (I know that 
threshold would be difficult to find, but at least it would reduce the 
problem space a bit).

In an approach like that patterns could be arbitrarily large, but I do 
have some fears that it might lead to over-generating VPlans. It would 
also probably mean that a lot of recipes are needed to model every 
instruction that can possibly be generated, but if the concrete costs 
were calculated at parse time and the information from the matched 
instructions were reduced to what will be needed to generate the output 
IR, it might be possible to reduce that.

I think your example in D88152 does go into a similar direction, only it 
is transforming VPInstructions to another VPInstruction, not to a recipe 
(The pattern matching does remind me a bit of Tablegen, tbh).
What I wonder is, if it looks like to leverage VPlan we need to do some 
kind of pattern matching, be it in the cost model for correct cost 
calculation, to detect or to transform patterns of VPInstructions, 
wouldn't it conceptually make sense to explicitly define something like 
a grammar, instead of doing it per instruction/pattern and per 
application (costing/transformation)?

What are your thoughts on this?

Thanks,
Anna

On 03.11.20 11:55, David Green wrote:
> Hello
>
> Perhaps I can talk a little about the things I have been at least thinking about lately. Unfortunately a few other important things have come up and I have not had the time do as much as I would have liked. What I have is mostly on the pragmatic end of the spectrum.
>
> So we've been thinking about this for a while but it really starts with vector reductions. Arm MVE has some instruction that make it beneficial to put vector reduction instruction into the loop, instead of summing in the loop and reducing after it. So instead of:
> loop:
>    a = load v16i8
>    s = s + a
> reduce(a)
>
> We can efficiently do:
> loop:
>    a = load v16i8
>    s += reduce a
>
> That is all working now, and seems to be humming along nicely in the vectorizer. It even handles tail predicated loops. The real benefit from those instructions though comes from the fact that they can natively handle larger than legal reductions. So we have an instruction that can take 16i8's and sum them into an i32. That would look something like this in IR:
> loop:
>    a = load v16i8
>    e = sext a to v16i32
>    s += reduce e
>
> It can also do a mla at the same time, doing reduce(mul(sext, sext)) as a single instruction. We even have i64 variants of the instruction, even though there is no i64 vector add in MVE. Unfortunately all of this is both too large for the vectorizer to consider at the moment and looks like a set of very expensive operations. Extends especially are very expensive at the moment in MVE.
>
> We can already cost model load(sext(..)) well (along with other patterns like add(sext, ..) for aarch64). We have the context parameter (plus a parameter for the type of the load now). For 2 instruction patterns this mostly works OK, if not very elegantly, providing that the context instruction can be trusted (which with vplan making more changes becomes a big if). Any more than 2 starts to get very awkward though, trying to check that you are the middle mul in a reduce(mul(sext, sext)) pattern. Especially if the reduce looks like an add!
>
> I had a go at implemented some code to try and get the reductions more correct. One idea was for a better way of providing context. Something like an "InstructionView", in the same way as there is an ArrayView or StringView. It gives a way to query possibly fake instructions for the info the backend might need to make decisions and could be backed by real instructions or vplan nodes. Another simpler idea was to keep a list of "free instructions" that the backend could provide and iterator backwards through the costs instead of forward, letting the backend greedily choose patterns. I didn't much like the changes they required though and didn't get very far with them as they are all very large changes that do not play well into the costmodel as it currently exists.
>
> So I knew that there were some other things that I wanted to start making better in the vectorizer/costmodel, maybe investigating those would give a better view onto what the best way to handle this would be. One is instructions like vmulh. They multiply and take the top half, and are common across a number of architectures. Things like vrmulh, vhadd, vrhadd, vqdmulh, vabsd, etc can all fit into the same bucket. They conceptually do something like:
> loop:
>    a = load v16i8
>    a = load v16i8
>    ae = sext a to v16i16
>    be = sext a to v16i16
>    m = mul ae, be
>    s = shr m, 8
>    t = trunc s to v16i8
>    store t
>
> So the same issue with sign extends being expensive exists, plus considering the cost of those instructions will be very wrong (especially if you consider a i32 extended to an i64 on an architecture with no 64 bit operations.)  That I put up as https://reviews.llvm.org/D88152, mostly as a proof of concept to show how it can be done. It recognizes the pattern in vplan and turns it into a single mulh VPInstruction simply enough. The obvious problem comes in where we are still costing the origin instructions in the loop though. Hence the cost model really needs to be moved over to costing vplan nodes, not the original IR instructions (which leads us to https://reviews.llvm.org/D89322 making a start at that).
>
> D88152 obviously still needs some details filled in. The rough idea was to have a set of "VectorPatterns" that the backend could be queried the cost of (like vmulh etc). In this case we are replacing 5 instructions with 1 at a lower bitwidth, so the change is going to be fairly obvious to make if it is legal, and can be done by replacing nodes. In other cases it may be better to clone the vplan and cost them against one another. It still seems simpler in these cases (and the concepts are general enough) that combining in the vectorizer seems like a good way forward. It doesn't really help costmodel anywhere else in the compiler though, if that is something that becomes important.
>
> We would love to be able to start doing things like cost predicated vplans against unpredicated vplans, along with things like de-interleaving the interleaving groups (kind of like slp vectorization), costing gathers vs interleaving loads, and even possibly things like getting the vectorizer to consider lane interleaving if we can. All of this requires us to be costing vplans though, at least as a first step. Doing some of these like (what I have been calling) lane interleaving certainly require more global analysis than the local checks we can do in places like instcombine or ISel, to make sure they are really profitable. It's hard to guess at the costs of them.
>
> I also found it very easy to introduce regressions when trying to rewrite an entire costmodel, as you might imagine. Simple enough things (and actually costing predicate instructions) can have severe effects on some code. It was something I was trying to go slowly on and hopefully do bit at a time. We will see how that goes though.
>
> In short, I think I like the idea of vplan looking more like the target expects it to end up as, as opposed to trying to reverse engineer or guess that in the costmodel. None of this has been through review or anything like that. It was just the kinds of things we were trying to work on and hoping on improving. VPlan has a lot of potential I believe, and it would be great to see it start solving some real problems.
>
> I'm not sure if that helped at all. Maybe it at least gave some context, even if it was a bit rambley.
>
> Looking forward to seeing what your thoughts are,
> Dave