[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

Tue Aug 5 08:32:02 PDT 2008

On Monday 04 August 2008 17:56, Dan Gohman wrote:

> > By "complicate" do you mean "need to look at multiple addresses from a
> > single instruction?"  Or is there more than that?  I'm trying to
> > understand
> > all the implications.
>
> I mean just that -- we have a fair amount of code built around looking
> at the addresses of load and store nodes that in some cases would need
> to be restructured if it would cope with multiple addresses at a time.

Ok.  I should think that this would be feasible to do.  In the worst case it's 
an N^2 loop looking at all pairs.  And N is usually going to be small.

> >>  %p = applymask <2 x f32*> %q, <2 x i1> %m
> >>  %x = load <2 x f32*>* %p                   ; implicitly masked by %m
> >>  %y = add <2 x f32> %x, %w                  ; implicitly masked by %m
> >>  %z = mul <2 x f32> %y, %y                  ; implicitly masked by %m
> >
> > Yuck.  I don't like this at all.  It makes reading the IR harder
> > because now
> > you need to worry about context.
>
> I don't disagree with these. I think it's a trade-off, with LLVM
> design philosophy and IR cleanliness arguments on both sides.
>
> The applymask approach leverages use-def information rather than
> what can be thought of as duplicating a subset of it, making the IR

I don't understand what you mean by "duplicating" here.  You need some
kind of use-def information for the masks themselves because at some
point they need to be register-allocated.

> less cluttered. And, it makes it trivially straightforward to write
> passes that work correctly on both masked and unmasked code.

I had a thought on this, actually.  Let's say the mask is the very last 
operand on masked instructions.  Most passes don't care about the mask
at all.  They can just ignore it.  Since they don't look at the extra operand 
right now, there shouldn't be many changes necessary (some asserts
may need to be fixed, etc.).

Think about instcombine.  It's matching patterns.  If the matcher doesn't
look at masks, that may be ok most of the time (mod corner cases which
I fully appreciate can be a real pain to track down).  If we want fancy 
instcombine tricks that understand masks, we can add those later.

> >  Not all dependencies are readily expressed
> > in the instructions.  How would one express TableGen patterns for such
> > things?
>
> The syntax above is an idea for LLVM IR. SelectionDAG doesn't
> necessarily
> have to use the same approach.

What do you mean by "ideal for LLVM IR?"  This looks very much _not_ ideal to
me from a debugging standpoint.  It's difficult to understand.  It took me 
reading through the proposal a few times to grok what you are talking about.

> I think we all recognize the need, and in the absence of better
> alternatives are willing to accept the mask operand approach. It would
> have a significant impact on everyone, even those that don't use masks.

How do you define "significant impact?"  Compile time?  Development effort?
Transition pain?  All of the above?  More?

For architectures that don't use masks, either the mask gets set to all 1's or
we have non-masked versions of operators.  I honestly don't know which is
the desireable route to take.  My guess is that the optimizers will have to
understand whether or not the target architecture supports masks and not
generate them (e.g. no if-conversion) if the target doesn't support them.

I wonder if there is some way to un-if-convert to eliminate masks if 
necessary.  I'm thinking about code portability and JIT issues when
readfing in LLVM IR that was produced at some earlier time.  Perhaps
this isn't an issue we need to worry about right now.

> I don't want to stand in the way of progress, but this alternative
> approach seems promising enough to be worth consideration.

Alternatives are always welcome and worth considering.  I'm looking at the
kind of things the LLVM community is going to want to support and I'm
pretty sure masks are going to be a very big part of architectures in the
future.  We're done with clock speed improvements, so we need to rely on
architecture more.  Vectorization is a well-known technique to improve
single thread performance and masks are critical to producing efficient vector 
code.

If y'all agree with this premise, it seems to me that we want to support
such architectures in as straightforward a way as possible so as to minimize
future pain when we're all writing complex and beautiful vector hacks.  :)

What can we learn from the IA64 and ARM backends?  How do they handle
their masks (scalar predication)?  Is all the if-conversion done in 
target-specific passes?

> > We concluded that operation results would be undefined for vector
> > elements
> > corresponding to a zero mask bit.
> >
> > We also talked about adding a vector select, which is crucial for
> > any code
> > that uses masks.
>
> Right. This applymask idea doesn't conflict with these.

Yep.  I just wanted to be thorough.

                                                  -Dave