[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

Dan Gohman gohman at apple.com
Thu Aug 7 19:48:42 PDT 2008


On Aug 7, 2008, at 12:13 PM, David Greene wrote:

> On Tuesday 05 August 2008 13:27, David Greene wrote:
>
>> Neither solution eliminates the need for instcombine to be careful  
>> and
>> consult masks from time to time.
>>
>> Perhaps I'm totally missing something.  Concrete examples would be  
>> helpful.
>
> Ok, so I took my own advice and thought about CSE and instcombine a  
> bit.
> I wrote the code by hand in a sort of pseudo-llvm language, so don't
> crucify me for mistakes.  :)
>
> CSE is an optimization that needs to pay attention to masks.  You  
> can't
> (easily) CSE an expression with different masks.
>
> Using a mask-per-operation setup, code might look like this:
>
> CSE Mask Example
>
> Mask-per-operation:
> %m1 = ...
> %m2 = ...
> %r1 = add %r2, %r3, %m1
> %r4 = add %r2, %r3, %m2
>
> We can only CSE the adds if %m1 == %m2.  I'd think it would be
> sufficient to just check whether the mask registers are the same,  
> though
> you could imagine walking back through the use-def chains to do more
> complex analysis.  Esentially, the mask operand becomes part of the
> expression you're trying to match.  CSE definitely needs to be
> changed to understand masks.
>
> With applymask, things look something like this:
>
> applymask:
> %m1 = ...
> %m2 = ...
> %r5 = applymask %r2, %m1
> %r6 = applymask %r3, %m1
> %r7 = applymask %r2, %m2
> %r8 = applymask %r3, %m2
> %r1 = add %r5, %r6
> %r4 = add %r7, %r8
>
> This is interesting because CSE will not do the optimization since the
> operands of the adds are different.  CSE does not have to change to
> understand masks.  Value numbering will have to understand applymask  
> to
> make sure it doesn't do something bad because it doesn't care about  
> the
> names of operands but the renaming of registers required by  
> applymask is
> somewhat attractive.

Right. One way to think about it is that applymask can conservatively be
treated as plain binary operator, for register-oriented passes at least.

>
>
> Now let's look at an instcombine example:
>
> Instcombine example: -A + -B --> -(A + B)
>
> Mask-per-operation:
> %m1 = ...
> %m2 = ...
> %m3 = ...
> %r1 = neg %r2, %m1
> %r1 = select %r1, %r2, %m1  # Define all elements
> %r3 = neg %r4, %m2
> %r3 = select %r3, %r4, %m2  # Define all elements
> %r5 = add %r1, %r3, %m3
>
> Since the masks on the negs are different we have to be careful in how
> we do this.  In the easiest case, we simply disallow combining here,
> which requires instcombine to examine the masks and, as in the CSE
> example above, potentially just look at the mask register names to
> determine equality.
>
>
> One could imagine doing the combining this way:
>
> %m1 = ...
> %m2 = ...
> %m3 = ...
> # Do the combining where legal
> %m4 = and %m1, %m2
> %r6 = add %r2, %r4, %m4
> %r7 = neg %r6, %m4
> %r7 = select %r7, %r6, %m4 # Define all elements for final merge
> # Take care of the rest
> %m6 = xor %m1, %m4
> %m7 = xor %m2, %m4
> %m8 = xor %m3, %m4
> %r1 = neg %r2, %m6
> %r1 = select %r1, %r2, %m6 # Define all elements for add
> %r3 = neg %r4, %m7
> %r3 = select %r3, %r4, %m7 # Define all elements for add
> %r5 = add %r1, %r3, %m8
> %r5 = select %r5, %r1, %m7 # Define all elements for final merge
> # Do the final merge
> %r5 = select %r7, %r5, %m4
>
> But this is obviously a loss.  We may be able to get away with
> eliminating some merges (I'm being conservative about undefined  
> elements
> here) but it would still be a loss.
>
> applymask:
>
> %m1 = ...
> %m2 = ...
> %m3 = ...
> %r6 = applymask %r2, %m1
> %r7 = applymask %r4, %m2
> %r1 = neg %r6
> %r3 = neg %r7
> %r8 = applymask %r1. %m3
> %r9 = applymask %r3, %m3
> %r5 = add %r8, %r9
>
> This is really no different than the mask-per-operation case.
> Instcombine still has to consider masks when evaluating the legality  
> and
> profitability of combining.  Here instcombine would have to walk back
> through the def-use lists on the operands of the neg and add to find  
> if
> they are masked and then determine the mask value (or mask register  
> name
> for simpler equality testing).
>
> Again, doing the combining in this case would be detrimental.  My  
> guess
> is that most combining turns out worse code when operations are masked
> and the masks are not equal.
>
> Given these two examples, I see some benefit to applymask for
> transformations like CSE and other pattern-matching algorithms where  
> the
> renaming of registers that applymask requires converts previously
> matching patterns into patterns that don't match.  For something like
> instcombine or value numbering that is based on value flow and/or pure
> semantics, I don't see any benefit.

I agree about value numbering. I think instcombine is a little
different. But only with the help of a pretty non-trivial
assumption.

One of the cornerstone assumptions of the applymask approach is
that code like what's in your examples is relatively uncommon,
and that the common case is larger groups of instructions
sharing mask values. If this assumption isn't true, many of
the advantages of applymask no longer hold. I should encourage
people to question this assumption first, because pretty much
everything related to applymask depends on it :-).

But if it is true, then we can say that instcombine's job is
made easier by applymask. Instcombine primarily does very
localized analysis, so it can look at something like this:

   %r0 = neg %r888
   %r1 = neg %r999
   %r2 = add %r0, %r1

and without knowing anything about r888 or r999 and whether
or not there were applymasks involved in producing their
value, it can safely do the transformation.

A key rule in my initial email is that the operands of any
operator must be under the same mask. The Verifier can enforce
this constraint. One problem though is that this places an
interesting burden on front-ends and vectorizers. I'll try to
write more about this soon.)

There are things instcombine can do when the masks are different.
I think you mentioned earlier it could do clever tricks ANDing
masks and such. However, based on the assumption above, this
wouldn't be used very often, so a conservative implementation of
instcombine that doesn't do any of this is mostly good enough.

>
>
> Of course, these are just two examples.  One could repeat the exercise
> for any transformation we're interested in.
>
> I hope this is helpful in moving us along in the discussion.  Feedback
> and comments are welcome!

Thanks for the examples! I think they have been helpful. I'll
try to add a few more here.

First, dead store elimination. It clearly needs to know if a
store is predicated. Mask operands would make this pretty
obvious. Applymask would make it less obvious, but still
doable, and would probably want memoization to be
efficient. This is case where mask operands are favored.

Here's one more example, or at least a skeleton of several
examples. A loop to be vectorized:

   for (...) {
     A
     if (...) {
       B
     } else {
       C
     }
     D
   }

Assuming there's a bunch of code in B and a bunch in C, then
we have four bunches of code and three mask conditions -
A and D are unmasked, B is masked, and C is masked with the
inverse mask value. This code could easily have more if
branches, nested loops, early exits, and so on, but the main
idea is that there are blocks of instructions grouped together
under the same mask, cross-block optimization opportunities
exist but are limited, and that this is assumed to be
basically representative.

There are some interesting cross-block cases here.
If you can prove that something in B and/or C can be run
unmasked, it could be CSE'd/LICM'd/PRE'd/etc.
If D has a PHI merging a value from B and C, it might be
nice to do a vector merge. It's interesting to look at
out how these kinds of cases work under both
mask operands and applymask.

Dan




More information about the llvm-dev mailing list