X86 Code Generator SIMD Design
------------------------------

1. Introduction

The advent of Intel's Advanced Vector Extensions (AVX) is an
opportunity time to examine the state of LLVM's x86 code generator and
make improvements.  Just as Intel took the AVX opportunity to unify
the various x86 SIMD instruction sets, the LLVM community can take the
opportunity to unify the x86 SIMD instruction specification,
simplifying things for the casual developer and reducing errors.

In this article we describe a new design for LLVM's x86 SIMD
instruction descriptions, explaining why a redesign is desirable and
providing a step-by-step walk-through of the hows and whys of the
implementation.  This new design will replace the existing LLVM x86
MMX and SSE specifications.

2. Motivation

Why do a redesign?  It's a reasonable question.  In fact the original
concept for the AVX specification assumed it would be a relatively
simple matter to add new instruction class templates on top of the
existing MMX and SSE infrastructure.  However, we soon discovered that
doing so would result in tremendous redundancy with equally tremendous
potential for things to get out of sync and lead to subtle bugs down
the road.  The maintenance would become impractical.  There is
potential for this to become exponentially worse if instructions such
as those proposed for the Larrabee architecture make it into
production processors.

2.1 Understanding the x86 Specification

Before going further, we need to have a baseline understanding of how
the existing x86 MMX and SSE specifications work.  We'll begin with the
simplest example.  Take the specification for SSE ADD:

// SSE1
defm ADD : basic_sse1_fp_binop_rm<0x58, "add", fadd, int_x86_sse_add_ss, 1>;

That looks easy enough.  It's boiled down to the bare essentials of the
instruction: an opcode, an asm mnemonic, an SDNode to match in a
pattern, an intrinsic operator to match for "hard-coded" uses and a
boolean indicating whether the operation is commutative (this becomes
clear when examining the definition of basic_sse1_fp_binop_rm, which we
will do shortly).

The superclass name implies that it's an sse1 floating-point instruction
with two operands that is encoded with a register destination and either
a register or memory operand.  Gleaning that from the name is mostly
straightforward except for the "rm" part, which becomes clearer when we
examine the definition of basic_sse1_fp_binop_rm:

/// basic_sse1_fp_binop_rm - SSE1 binops come in both scalar and vector forms.
///
/// In addition, we also have a special variant of the scalar form here to
/// represent the associated intrinsic operation.  This form is unlike the
/// plain scalar form, in that it takes an entire vector (instead of a scalar)
/// and leaves the top elements undefined.
///
/// These three forms can each be reg+reg or reg+mem, so there are a total of
/// six "instructions".
///
let Constraints = "$src1 = $dst" in {
multiclass basic_sse1_fp_binop_rm<bits<8> opc, string OpcodeStr,
                                  SDNode OpNode, Intrinsic F32Int,
                                  bit Commutable = 0> {
  // Scalar operation, reg+reg.
  def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst), (ins FR32:$src1, FR32:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
    let isCommutable = Commutable;
  }

  // Scalar operation, reg+mem.
  def SSrm : SSI<opc, MRMSrcMem, (outs FR32:$dst),
                                 (ins FR32:$src1, f32mem:$src2),
                 !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                 [(set FR32:$dst, (OpNode FR32:$src1, (load addr:$src2)))]>;
                 
  // Vector operation, reg+reg.
  def PSrr : PSI<opc, MRMSrcReg, (outs VR128:$dst),
                                 (ins VR128:$src1, VR128:$src2),
               !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
               [(set VR128:$dst, (v4f32 (OpNode VR128:$src1, VR128:$src2)))]> {
    let isCommutable = Commutable;
  }

  // Vector operation, reg+mem.
  def PSrm : PSI<opc, MRMSrcMem, (outs VR128:$dst),
                                 (ins VR128:$src1, f128mem:$src2),
                 !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
             [(set VR128:$dst, (OpNode VR128:$src1, (memopv4f32 addr:$src2)))]>;

  // Intrinsic operation, reg+reg.
  def SSrr_Int : SSI<opc, MRMSrcReg, (outs VR128:$dst),
                                     (ins VR128:$src1, VR128:$src2),
                     !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                     [(set VR128:$dst, (F32Int VR128:$src1, VR128:$src2))]> {
    let isCommutable = Commutable;
  }

  // Intrinsic operation, reg+mem.
  def SSrm_Int : SSI<opc, MRMSrcMem, (outs VR128:$dst),
                                     (ins VR128:$src1, ssmem:$src2),
                     !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
                     [(set VR128:$dst, (F32Int VR128:$src1,
                                               sse_load_f32:$src2))]>;
}
}

That's a lot of code, but it is relatively simple to understand once one
get the hang of TableGen syntax.  The class takes the arguments we
passed from the "ADD" defm above.  The multiclass automatically
generates a set of definitions, appending the name associated with each
"def" to the name of the defm, in this case, "ADD."  So "defm ADD"
results in the following:

def ADDSSrr<...>;
def ADDSSrm<...>;
def ADDPSrm<...>;
def ADDPSrm<...>;
def ADDSSrr_Int<...>;
def ADDSSrm_Int<...>;

Now it becomes a little clearer what that "rm" in the multiclass name
means.  It's a mnemonic to remember that the multiclass generates
variants of the x86 instruction with a register destination and a
register or memory source operand.  The analogue "mr" mnemonic
represents an instruction with either a register or memory destination
and a register source.

Multiclasses and defms are tremendously useful and reduce a lot of
tedious coding effort.  Let's look at one of the defs inside the
multiclass:

// Scalar operation, reg+reg.
def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst), (ins FR32:$src1, FR32:$src2),
               !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
               [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
  let isCommutable = Commutable;
}

The SSrr def passes the opcode to the SSI class along with some more
information.  It tells SSI that the instruction uses "MRMSrcReg"
encoding, basically meaning that the x86 ModR/M byte encodes either a
source register or a source memory address (with an optional SIB byte
-- see your handy Intel x86 programmer's manual for further fun).
This information is important for the JIT and anything else that
creates object code.

Next are a couple of "dag" constructs.  The first specifies the outputs
of the instruction (one FR32 32-bit float scalar register in this case)
while the second specifies inputs (two 32-bit float scalar registers).
The names "$dst," "$src1" and "$src2" identify the operands.

After that the def passes a generated asm string template to SSI.  In
this case it takes the asm mnemonic passed in ("add") and prepends it to
a rather long string ("ss\t{$src2, $dst|$dst, $src2}").  The important
thing to understand about this string is the pipe in the middle.  This
separates two alternatives for asm encoding.  One can specify x86
instructions in one of two ways: with Intel syntax where the destination
is the left-most operand (like a = b + c) or AT&T syntax, where the
destination is the rightmost operand (like b + c -> a).  The TableGen
system knows to generate two separate AsmWriters and uses the pipe to
determine the alternative syntaxes.

So in the end this fragment of code generates the following two asm
strings:

addss\t$src2, $dst  // AT&T
addss\t$dst, $src2  // Intel

So what happened to $src1?  Since most x86 instructions are
two-address, there is an inherent constraint that the destination of
the instruction must be the same as one of its sources, in this case a
register.  This is expressed in TableGen with the "let Constraints =
"$src1 = $dst" in" construct immediately preceding the definition of
basic_sse1_fp_binop_rm.  TableGen will generate code to match an ADD
only if some source operand can be matched to the destination operand.
The "isCommutable" parameter tells TableGen whether it can commute the
operands of the instruction to get things into the right form.

The final argument to the SSI class is the key to doing the
pattern-matching code generation.  It's a construct describing what
SelectionDAG structure can be implemented with an ADD instruction.  For
ADDSSrr, the DAG specification is:

[(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]

It's actually a list of DAGs but that has no consequence for the
existing SSE specification as the lists is always size one.

This basically says a DAG that has a destination register operand whose
value comes from some operation (fadd as passed from the defm) on two
register operands can be matched to an ADDSSrr instruction, provided
that $src1 is the same register as $dst.  Actually, because isCommutable
is set, it is sufficient for one of $src1 or $src2 to be the same
register as $dst.

These patterns are generally the most difficult things to get right in a
TableGen specification.  All of the other stuff like opcodes, encoding
specifications and so on can be read right out of the Intel instruction
manuals.  However, there are many different ways one might match to an
ADD or any other instruction and care must be taken to either cover all
the cases or manually canonicalize the DAG to a small number of
patterns.

Diving down further, let's look at the SSI class:

class SSI<bits<8> o, Format F, dag outs, dag ins, string asm, list<dag> pattern>
      : I<o, F, outs, ins, asm, pattern>, XS, Requires<[HasSSE1]>;

This is essentially a pass-through to the common x86 "I" class for
instructions:

class I<bits<8> o, Format f, dag outs, dag ins, string asm, list<dag> pattern>
  : X86Inst<o, f, NoImm, outs, ins, asm> {
  let Pattern = pattern;
  let CodeSize = 3;
}

class X86Inst<bits<8> opcod, Format f, ImmType i, dag outs, dag ins,
              string AsmStr>
  : Instruction {
  let Namespace = "X86";

  bits<8> Opcode = opcod;
  Format Form = f;
  bits<6> FormBits = Form.Value;
  ImmType ImmT = i;
  bits<3> ImmTypeBits = ImmT.Value;

  dag OutOperandList = outs;
  dag InOperandList = ins;
  string AsmString = AsmStr;

  //
  // Attributes specific to X86 instructions...
  //
  bit hasOpSizePrefix = 0;   // Does this inst have a 0x66 prefix?
  bit hasAdSizePrefix = 0;   // Does this inst have a 0x67 prefix?

  bits<4> Prefix = 0;        // Which prefix byte does this inst have?
  bit hasREX_WPrefix  = 0;   // Does this inst requires the REX.W prefix?
  FPFormat FPForm;           // What flavor of FP instruction is this?
  bits<3> FPFormBits = 0;
  bit hasLockPrefix = 0;     // Does this inst have a 0xF0 prefix?  
  bits<2> SegOvrBits = 0;    // Segment override prefix.
}

The X86inst class is where we describe the nitty-gritty of binary
encoding.  The important things to pick out are the *Prefix members and
the Form specifier.  We've seen a value of "Form" before in MRMSrcReg.
The *Prefix members remember which instruction prefix bytes, together
with the opcode, complete the instruction operation sequence.  For an
SSE ADD instruction, the opcode is always 0x58.  But does this mean a
single-precision add, a double-precision add, a scalar add or a vector
add?  That's what the prefix bytes tell us.

There are seven basic prefixes for SSE instructions, described by the
following classes:

// Prefix byte classes which are used to indicate to the ad-hoc machine code
// emitter that various prefix bytes are required.
class OpSize { bit hasOpSizePrefix = 1; }  // 0x66
class REX_W  { bit hasREX_WPrefix = 1; }
class TB     { bits<4> Prefix = 1; }       // Two-byte opcode: 0x0F
class XD     { bits<4> Prefix = 11; }      // SSE double-precision: 0xF2 0x0F
class XS     { bits<4> Prefix = 12; }      // SSE single-precision: 0xF3 0x0F
class T8     { bits<4> Prefix = 13; }      // SSE 0x0F 0x38
class TA     { bits<4> Prefix = 14; }      // SSE 0x0F 0x3A

SSE1 and SSE2 instructions have fairly regular encoding.  All those
instructions have a Two-Byte (TB) prefix to distinguish them from the
other scalar instructions.  For example, in the case of SSE ADD, the
POP instruction also has opcode 0x58 (consult the x86 Programmer's
Manual opcode tables for the gory details) so the TB prefix
distinguished the SSE ADD from POP.

SSE1 instructions operate on single-precision data.  The vector version
ADDPS uses no additional prefixes so its operation encoding is "0x0F
0x58."  The scalar ADDSS needs another prefix to distinguish itself from
the vector version, so it uses the 0xF3 "XS" (for XMM Single) prefix,
giving it an operation encoding of "0xF3 0x0F 0x58."

SSE2 instructions operate on double-precision data.  To indicate this
the floating point SSE2 vector instructions use the 0x66 "OpSize"
operand-size override prefix, yielding the operation encoding "0x66 0x0F
0x58" for ADDPD.  The ADDSD scalar double-precision instruction uses the
0xF2 "XD" (XMM Double) prefix, making its operation encoding "0xF2 0x0F
0x58."

Later SSE instruction sets are less regular in their use of prefixes and
that needs to be accounted for in the instruction specifications.  We
will see that later on.

The SSI class adds the XS prefix to its subclasses, marking them a
scalar single-precision instruction.  The PSI class similarly adds the
TB prefix to packed instructions.  In addition, both SSI and PSI add a
Requires<> predicate HasSSE1, which tells TableGen to only match the
pattern if the target processor implements SSE1 instructions.

2.2 Redundancy in the SSE Specification

Sending ourselves back down the class hierarchy, if we compare the defs
in the basic_sse1_fp_binop_rm multiclass we some a fair amount of
redundancy:

// Scalar operation, reg+reg.
def SSrr : SSI<opc, MRMSrcReg, (outs FR32:$dst), (ins FR32:$src1, FR32:$src2),
               !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
               [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]> {
  let isCommutable = Commutable;
}

// Scalar operation, reg+mem.
def SSrm : SSI<opc, MRMSrcMem, (outs FR32:$dst),
                               (ins FR32:$src1, f32mem:$src2),
               !strconcat(OpcodeStr, "ss\t{$src2, $dst|$dst, $src2}"),
               [(set FR32:$dst, (OpNode FR32:$src1, (load addr:$src2)))]>;
                 
// Vector operation, reg+reg.
def PSrr : PSI<opc, MRMSrcReg, (outs VR128:$dst),
                               (ins VR128:$src1, VR128:$src2),
             !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
             [(set VR128:$dst, (v4f32 (OpNode VR128:$src1, VR128:$src2)))]> {
  let isCommutable = Commutable;
}

// Vector operation, reg+mem.
def PSrm : PSI<opc, MRMSrcMem, (outs VR128:$dst),
                               (ins VR128:$src1, f128mem:$src2),
               !strconcat(OpcodeStr, "ps\t{$src2, $dst|$dst, $src2}"),
           [(set VR128:$dst, (OpNode VR128:$src1, (memopv4f32 addr:$src2)))]>;

Both "rr" defs pass MRMSrcReg as the ModR/M encoding, both "rm" defs
pass MRMSrcMem.  There is a direct mapping from the def name "SSrr,
PSrr, etc."  to the asm string appended to the mnemonic passed from the
defm (SS -> "ss", PD -> "pd" and so on).  The operand DAGs look
identical among the rr and rm variants and only the class of the operand
(VR128, f128mem) varies among all the defs.

Most importantly, the patterns for all four instructions look almost
identical:

[(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))]                // SSrr
[(set VR128:$dst, (v4f32 (OpNode VR128:$src1, VR128:$src2)))]     // PSrr
[(set FR32:$dst, (OpNode FR32:$src1, (load addr:$src2)))]         // SSrm
[(set VR128:$dst, (OpNode VR128:$src1, (memopv4f32 addr:$src2)))] // PSrm

All the rr patterns have the basic structure:

[(set <REGCLASS>:$dst, (OpNode <REGCLASS>:$src1, <REGCLASS>:$src2))]

and all the rm patterns have the basic structure:

[(set <REGCLASS>:$dst, (OpNode <REGCLASS>:$src1, (<MEMOP> addr:$src2)))]

where the identifiers between angle brackets indicate a placeholder
that stands in for a specific object depending on the type of the
operation (scalar, vector, single-precision, etc.).  Note that PSrr
has an extra type specifier.  That can be accounted for by generalizing
the patterns:

[(set <REGCLASS>:$dst, (TYPE (OpNode <REGCLASS>:$src1, <REGCLASS>:$src2)))]

There's no harm in specifying a type even if it could be inferred from
the dag.  In fact we sometimes want to do so as an error check.

Now let's look at the SSE2 ADD:

defm ADD : basic_sse2_fp_binop_rm<0x58, "add", fadd, int_x86_sse2_add_sd, 1>;

This looks distressingly similar to the SSE1 ADD defm.  And in fact
basic_sse2_fp_binop_rm also looks familiar:

/// basic_sse2_fp_binop_rm - SSE2 binops come in both scalar and vector forms.
///
/// In addition, we also have a special variant of the scalar form here to
/// represent the associated intrinsic operation.  This form is unlike the
/// plain scalar form, in that it takes an entire vector (instead of a scalar)
/// and leaves the top elements undefined.
///
/// These three forms can each be reg+reg or reg+mem, so there are a total of
/// six "instructions".
///
let Constraints = "$src1 = $dst" in {
multiclass basic_sse2_fp_binop_rm<bits<8> opc, string OpcodeStr,
                                  SDNode OpNode, Intrinsic F64Int,
                                  bit Commutable = 0> {
  // Scalar operation, reg+reg.
  def SDrr : SDI<opc, MRMSrcReg, (outs FR64:$dst), (ins FR64:$src1, FR64:$src2),
                 !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                 [(set FR64:$dst, (OpNode FR64:$src1, FR64:$src2))]> {
    let isCommutable = Commutable;
  }

  // Scalar operation, reg+mem.
  def SDrm : SDI<opc, MRMSrcMem, (outs FR64:$dst),
                                 (ins FR64:$src1, f64mem:$src2),
                 !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                 [(set FR64:$dst, (OpNode FR64:$src1, (load addr:$src2)))]>;
                 
  // Vector operation, reg+reg.
  def PDrr : PDI<opc, MRMSrcReg, (outs VR128:$dst),
                                 (ins VR128:$src1, VR128:$src2),
               !strconcat(OpcodeStr, "pd\t{$src2, $dst|$dst, $src2}"),
               [(set VR128:$dst, (v2f64 (OpNode VR128:$src1, VR128:$src2)))]> {
    let isCommutable = Commutable;
  }

  // Vector operation, reg+mem.
  def PDrm : PDI<opc, MRMSrcMem, (outs VR128:$dst),
                                 (ins VR128:$src1, f128mem:$src2),
                 !strconcat(OpcodeStr, "pd\t{$src2, $dst|$dst, $src2}"),
             [(set VR128:$dst, (OpNode VR128:$src1, (memopv2f64 addr:$src2)))]>;

  // Intrinsic operation, reg+reg.
  def SDrr_Int : SDI<opc, MRMSrcReg, (outs VR128:$dst),
                                     (ins VR128:$src1, VR128:$src2),
                     !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                     [(set VR128:$dst, (F64Int VR128:$src1, VR128:$src2))]> {
    let isCommutable = Commutable;
  }

  // Intrinsic operation, reg+mem.
  def SDrm_Int : SDI<opc, MRMSrcMem, (outs VR128:$dst),
                                     (ins VR128:$src1, sdmem:$src2),
                     !strconcat(OpcodeStr, "sd\t{$src2, $dst|$dst, $src2}"),
                     [(set VR128:$dst, (F64Int VR128:$src1,
                                               sse_load_f64:$src2))]>;
}
}

There's very little difference here from basic_sse1_fp_binop_rm.  Some
D's instead of S's and they subclass from SDI/PDI rather than SSI/PSI.
SDI looks very much like SSI:

class SDI<bits<8> o, Format F, dag outs, dag ins, string asm, list<dag> pattern>
      : I<o, F, outs, ins, asm, pattern>, XD, Requires<[HasSSE2]>;

The only difference being the XD prefix rather than XS and the different
ISA requirement (HasSSE2).

That's a lot of code to specify eight instructions that look very
similar to each other -- ADDSSrr, ADDSSrm, ADDPSrr, ADDPSrm, ADDSDrr,
ADDSDrm, ADDPDrr, ADDPDrm.

It's unlikely that anyone will go mucking around with the ADD
instruction patterns in the near future, but consider a more complicated
case like a vector shuffle/permute:

let Constraints = "$src1 = $dst" in {
  let isConvertibleToThreeAddress = 1 in // Convert to pshufd
    def SHUFPSrri : PSIi8<0xC6, MRMSrcReg,
                          (outs VR128:$dst), (ins VR128:$src1,
                           VR128:$src2, i8imm:$src3),
                          "shufps\t{$src3, $src2, $dst|$dst, $src2, $src3}",
                          [(set VR128:$dst,
                            (v4f32 (shufp:$src3 VR128:$src1, VR128:$src2)))]>;
}

let Constraints = "$src1 = $dst" in {
  def SHUFPDrri : PDIi8<0xC6, MRMSrcReg,
                 (outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i8imm:$src3),
                 "shufpd\t{$src3, $src2, $dst|$dst, $src2, $src3}",
                 [(set VR128:$dst,
                   (v2f64 (shufp:$src3 VR128:$src1, VR128:$src2)))]>;
}

Again, the patterns look very similar.  "shufp" is a small fragment
that matches an extractelement LLVM instruction and checks whether the
provided shuffle vector to the LLVM extractelement instruction is
something expressible by the SHUFP[SD] instructions.  Essentially, it
makes sure that the lower elements of the result come from the
$src1/$dst input and the upper elements comes from the $src2 input.
It looks like this:

def shufp : PatFrag<(ops node:$lhs, node:$rhs),
                    (vector_shuffle node:$lhs, node:$rhs), [{
  return X86::isSHUFPMask(cast<ShuffleVectorSDNode>(N));
}], SHUFFLE_get_shuf_imm>;

This essentially gets "inlined" into patterns where it is used, yielding
this for SHUFPS:

[(set VR128:$dst,
      (v4f32 (vector_shuffle VR128:$src1, VR128:$src2 $src3)))]

So we could save some code replication by unifying the patterns as with
ADD above.  However, a more serious issue presents itself as we explore
further.  The following patterns appear later in the SSE specification
(reordered for pedagogy):

// Special unary SHUFPSrri case.
def : Pat<(v4f32 (pshufd:$src3 VR128:$src1, (undef))),
          (SHUFPSrri VR128:$src1, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
      Requires<[HasSSE1]>;

// Special unary SHUFPDrri case.
def : Pat<(v2i64 (pshufd:$src3 VR128:$src1, (undef))),
          (SHUFPDrri VR128:$src1, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
      Requires<[HasSSE2]>;
// Special unary SHUFPDrri case.
def : Pat<(v2f64 (pshufd:$src3 VR128:$src1, (undef))),
          (SHUFPDrri VR128:$src1, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
      Requires<[HasSSE2]>;

// Special binary v4i32 shuffle cases with SHUFPS.
def : Pat<(v4i32 (shufp:$src3 VR128:$src1, (v4i32 VR128:$src2))),
          (SHUFPSrri VR128:$src1, VR128:$src2,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
           Requires<[HasSSE2]>;
def : Pat<(v4i32 (shufp:$src3 VR128:$src1, (bc_v4i32 (memopv2i64 addr:$src2)))),
          (SHUFPSrmi VR128:$src1, addr:$src2,
                    (SHUFFLE_get_shuf_imm VR128:$src3))>,
           Requires<[HasSSE2]>;

// Special binary v2i64 shuffle cases using SHUFPDrri.
def : Pat<(v2i64 (shufp:$src3 VR128:$src1, VR128:$src2)),
          (SHUFPDrri VR128:$src1, VR128:$src2,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
          Requires<[HasSSE2]>;

// vector_shuffle v1, v2 <4, 5, 2, 3> using SHUFPSrri (we prefer movsd, but
// fall back to this for SSE1)
def : Pat<(v4f32 (movlp:$src3 VR128:$src1, (v4f32 VR128:$src2))),
          (SHUFPSrri VR128:$src2, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>;

There's another useful TableGen concept here: the Pat<> pattern.  Pat<>
patterns are essentially a way to map multiple patterns onto one target
machine instruction.  It's as if we had passed multiple patterns to
match to PSIi8 and PDIi8.  "pshufd" is a more specialized variant of
shuf:

def pshufd : PatFrag<(ops node:$lhs, node:$rhs),
                     (vector_shuffle node:$lhs, node:$rhs), [{
  return X86::isPSHUFDMask(cast<ShuffleVectorSDNode>(N));
}], SHUFFLE_get_shuf_imm>;

X86::isPSHUFDMask simply checks for a different sequence of elements in
the vector_shuffle mask vector.  The first Pat<> pat above:

// Special unary SHUFPSrri case.
def : Pat<(v4f32 (pshufd:$src3 VR128:$src1, (undef))),
          (SHUFPSrri VR128:$src1, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
      Requires<[HasSSE1]>;

says that a vector_shuffle of a source vector and some undefined vector
(i.e. anything at all) whose mask satisfies the X86::isPSHUFDMask
predicate can be implemented with a SHUFPSrri instruction.
"SHUFFLE_get_shuf_imm" is the TableGen way of invoking target-specific
code to convert the vector_extract vector mask to the immediate control
bits expected by the SHUFPS instruction.

If we examine the set of Pat<> patterns above, we notice some
non-orthogonality.  There are unary patterns for SHUFPS and SHUFPD
operating on floats:

// Special unary SHUFPSrri case.
def : Pat<(v4f32 (pshufd:$src3 VR128:$src1, (undef))),
          (SHUFPSrri VR128:$src1, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
      Requires<[HasSSE1]>;

// Special unary SHUFPDrri case.
def : Pat<(v2f64 (pshufd:$src3 VR128:$src1, (undef))),
          (SHUFPDrri VR128:$src1, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
      Requires<[HasSSE2]>;

But SHUFPD has an additional pattern for operating on integers:

// Special binary v2i64 shuffle cases using SHUFPDrri.
def : Pat<(v2i64 (shufp:$src3 VR128:$src1, VR128:$src2)),
          (SHUFPDrri VR128:$src1, VR128:$src2,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>,
          Requires<[HasSSE2]>;

We could just as well add a similar pattern for SHUFPS but it doesn't
currently exist.  Why not?  One can only guess.  Perhaps someone saw the
opportunity to optimize a sequence of packed 64-bit integers but didn't
have time to implement the same for packed 32-bit integers.  Perhaps it
was just an oversight.  But in any event, the capabilities of generating
32-bit code and 64-bit code diverge here.

One can see similar divergence in the "binary" Pat<> patterns.  SHUFPS
has an extra pattern that includes a bitconvert.  The SHUFPD analogue is
missing.

This extra pattern for SHUFPS:

// vector_shuffle v1, v2 <4, 5, 2, 3> using SHUFPSrri (we prefer movsd, but
// fall back to this for SSE1)
def : Pat<(v4f32 (movlp:$src3 VR128:$src1, (v4f32 VR128:$src2))),
          (SHUFPSrri VR128:$src2, VR128:$src1,
                     (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>;

could also have an analogue for SHUFPD to essentially do a MOVLPD-like
operation from register to register (MOVLPD only moves to/from memory).

So the maintenance headaches of the current scheme are already
apparent.  The pattern space is not orthogonal and LLVM is likely
missing opportunities to generate better instruction sequences.  We
would like to unify the patterns into a common framework that can
automatically generate sets of orthogonal patters with all of the
varying vector and scalar types available to x86 SIMD instructions.

The next post will start to dive into the design of the x86 SIMD
reorganization.