[LLVMdev] RFC: AVX Pattern Specification [LONG]

Thu Apr 30 15:59:11 PDT 2009

Here's the big RFC.

A I've gone through and designed patterns for AVX, I quickly realized that the 
existing SSE pattern specification, while functional, is less than ideal in 
terms of maintenance.  In particular, a number of nearly-identical patterns 
are specified all over for nearly-identical instructions.  For example:

let Constraints = "$src1 = $dst" in {
multiclass basic_sse1_fp_binop_rm<bits<8> opc, string OpcodeStr,
                                  SDNode OpNode, Intrinsic F32Int,
                                  bit Commutable = 0> {
  // Scalar operation, reg+reg.
  def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst), 
                                 (ins FR32:$src1, FR32:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
    let isCommutable = Commutable;
  }

  // Scalar operation, reg+mem.
  def SSrm : SSI<opc, MRMSrcMem, (outs FR32:$dst),
                                 (ins FR32:$src1, f32mem:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, (load addr:$src2)))]>;

  // Vector operation, reg+reg.
  def PSrr : PSI<opc, MRMSrcReg, (outs VR128:$dst),
                                 (ins VR128:$src1, VR128:$src2),
               !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
               [(set VR128:$dst, (v4f32 (OpNode VR128:$src1, 
                                                VR128:$src2)))]> {
    let isCommutable = Commutable;
  }

  // Vector operation, reg+mem.
  def PSrm : PSI<opc, MRMSrcMem, (outs VR128:$dst),
                                 (ins VR128:$src1, f128mem:$src2),
                 !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
             [(set VR128:$dst, (OpNode VR128:$src1, 
                                       (memopv4f32 addr:$src2)))]>;
}

These are all essentially the same except that ModRM formats, types and 
register classes change.  For patterns that access memory there are special 
"memory access operators" like memopv4f32.  But the base pattern of dest = 
src1 op src2 is the same.

Worse yet:

let Constraints = "$src1 = $dst" in {
multiclass basic_sse2_fp_binop_rm<bits<8> opc, string OpcodeStr,
                                  SDNode OpNode, Intrinsic F64Int,
                                  bit Commutable = 0> {
  // Scalar operation, reg+reg.
  def SDrr : SDI<opc, MRMSrcReg, (outs FR64:$dst), 
                                 (ins FR64:$src1, FR64:$src2),
                 !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                 [(set FR64:$dst, (OpNode FR64:$src1, FR64:$src2))]> {
    let isCommutable = Commutable;
  }

  // Scalar operation, reg+mem.
  def SDrm : SDI<opc, MRMSrcMem, (outs FR64:$dst), 
                                 (ins FR64:$src1, f64mem:$src2),
                 !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                 [(set FR64:$dst, (OpNode FR64:$src1, (load addr:$src2)))]>;

  // Vector operation, reg+reg.
  def PDrr : PDI<opc, MRMSrcReg, (outs VR128:$dst), 
                                 (ins VR128:$src1, VR128:$src2),
               !strconcat(OpcodeStr, "pd\t{$src2, $dst|$dst, $src2}"),
               [(set VR128:$dst, (v2f64 (OpNode VR128:$src1, 
                                                VR128:$src2)))]> {
    let isCommutable = Commutable;
  }

  // Vector operation, reg+mem.
  def PDrm : PDI<opc, MRMSrcMem, (outs VR128:$dst), 
                                 (ins VR128:$src1, f128mem:$src2),
                 !strconcat(OpcodeStr, "pd\t{$src2, $dst|$dst, $src2}"),
                 [(set VR128:$dst, (OpNode VR128:$src1, 
                                           (memopv2f64 addr:$src2)))]>;
}

This looks identical to basic_sse1_fp_binop_rm except it's SD/PD instead of 
SS/PS and types and register classes differ in a predictable way.

So we already have two "levels" or redundancy: one within a single multiclass 
and then redudancy across multiclasses.

This gets even worse with more complicated patterns like converts.  
Essentially the same complex pattern gets duplicated for the variously-sized 
converts.  Bug fixes in once place need to be replicated everywhere and it's 
easy to miss one or two.  This is the very definition of "maintenance 
problem."

Moreover, the various SSE levels were implemented at different times and do 
things subtly differently.  For example:

SSE1 :

  def ANDNPSrr : PSI<0x55, MRMSrcReg,
                     (outs VR128:$dst), (ins VR128:$src1, VR128:$src2),
                     "andnps\t{$src2, $dst|$dst, $src2}",
                     [(set VR128:$dst,
                       (v2i64 (and (xor VR128:$src1,
                                    (bc_v2i64 (v4i32 immAllOnesV))),
                               VR128:$src2)))]>;

SSE2 :

  def ANDNPDrr : PDI<0x55, MRMSrcReg,
                     (outs VR128:$dst), (ins VR128:$src1, VR128:$src2),
                     "andnpd\t{$src2, $dst|$dst, $src2}",
                     [(set VR128:$dst,
                       (and (vnot (bc_v2i64 (v2f64 VR128:$src1))),
                        (bc_v2i64 (v2f64 VR128:$src2))))]>;

Note the use of xor vs. vnot and the different placement of the bc* fragments
and use of type specifiers.  I wonder if we even match both of these.

And naming is not consistent:

def Int_CVTSS2SIrr : SSI<0x2D, MRMSrcReg, (outs GR32:$dst), (ins VR128:$src),
def MOVUPSrm_Int : PSI<0x10, MRMSrcMem, (outs VR128:$dst), (ins f128mem:$src),

Furthermore, the current scheme ties patterns to prefix encodings and Requires 
predicates:

  // Scalar operation, reg+reg.
  def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst), 
                                 (ins FR32:$src1, FR32:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
    let isCommutable = Commutable;
  }

>From X86InstrFormats.td:

class SSI<bits<8> o, Format F, dag outs, dag ins, string asm, 
          list<dag> pattern>
      : I<o, F, outs, ins, asm, pattern>, XS, Requires<[HasSSE1]>;

For AVX we would need a different set of format classes because while AVX 
could reuse the existing XS class (it's recoded as part of the VEX prefix so 
we still need the information XS provides), "Requires<[HasSSE1]>" is 
certainly inappropriate.  Initially I started factoring things out to separate 
XS and other prefix classes from Requires<> but that didn't solve the pattern 
problems I mentioned above.

All of this complication gets multipled with AVX because AVX recodes all of 
the legacy SSE instructions using VEX to provide three-address forms.  So if 
we were to follow the existing sceheme, we would duplicate *all* of 
X86InstrSSE.td and edit patterns to match three-address modes and then add the 
256-bit patterns on top of that,  effectively duplicating X86InstrSSE.td a 
second time.

This is not scalable.

So what I've done is a little experiment to see if I can unify all SSE and AVX 
SIMD instructions under one framework.  I'll leave MMX and 3dNow alone since 
they're oddballs and hardly anyone uses them.

Essentially I've created a set of base pattern classes that are very generic.  
These contain the basic asm string templates and dag patterns we want to 
match.  These classes are parameterized by things like register class, 
operand type, ModRM format and "memory access operation."  I've also created 
patterns that take a fully specified asm string and/or dag pattern to provide 
flexibility for "oddball" instructions.

Multiclasses sit on top of the patterns and aggregate various legal 
combinations (e.g. SS, SD, PS, PD for basic arithmetic).  There's a set
of base multiclasses and a set of derived multiclasses that aggregate
things into legal sets.  For example, some SSE instructions are vector-only
while others have scalar and vector versions.  Some instructions use the
XS, XD, TB and OpSize/TB prefixes while others use the TA, T8 and OpSize
prefixes.

The point of all of this is to write patterns and asm strings *once* for each
kind of instruction (binary arithmetic, convert, shuffle, etc.) and then use 
multiclasses to generate all of the concrete patterns for SSE and AVX.

So for example, an ADD would be specified like this:

// Arithmetic ops with intrinsics and scalar equivalents
defm ADD : 
sse1_sse2_avx_binary_scalar_xs_xd_vector_tb_ostb_node_intrinsic_rm_rrm<
   0x58,   // Opcode
   "add",  // asm base opcode name
   fadd,   // SDNode name
   "add",  // Intrinsic base name (we pre-concat int_x86_sse*/avx and 
           // post-contact ps/pd, etc.)
   1       // Commutative
>;

Now the multiclass name is rather unwieldy, I know.  That can be changed so 
don't worry too much about it.  I'm more concerned about the overall scheme
and that it make sense to you all.

I have a Perl script that auto-generates the necessary mutliclass combinations 
as well as the needed base classes depending on what's in the top-level .td 
file.  For now, I've named that top-level file X86InstrSIMD.td.

The Perl script would only be need to run as X86InstrSIMD.td changes.  Thus
its use would be similar to how we use autoconf today.  We only run autoconf / 
automake when we update the .ac files, not as part of the build process.

Initially, X86InstrSIMD.td would define only AVX instructions so it would not
impact existing SSE clients.  My intent is that X86InstrSIMD.td essentially
become the canonical description of all SSE and AVX instructions and 
X86InstrSSE.td would go away completely.

Of course we would not transition away from X86InstrSSE.td until 
X86InstrSIMD.td is proven to cover all current uses of  SSE correctly.

The pros of the scheme:

* Unify all "important" x86 SIMD instructions into one framework and provide 
  consistency

* Specify patterns and asm strings *once* per instruction type / family
  rather than the current scheme of multiple patterns for essentially the
  same instruction

* Bugfixes / optimizations / new patterns instantly apply to all SSE levels 
  and AVX

The cons:

* Transition from X86InstrSSE.td

* A more complex class hierarchy

* A class-generating tool / indirection

Personally, I think the pros far outweigh the cons but I realize that this 
proposes a major change and there are probably cons I haven't considered (and 
pros as well!).

So right now I'm looking for comments.  This is the way I intend to go because 
it's far easier in the long run considering maintenance and future extension.

I'll post an example as soon as I have time to package it up and get approval 
on this end to release it.  As of now I have simple arithmetic operations 
implemented and the proposed scheme seems to work.  Right now we're generating 
simple arithmetic and having it correctly assembled by gas.

Thanks for your ideas and input.

                             -Dave