X86 Code Generator SIMD Design ------------------------------ 1. Introduction The advent of Intel's Advanced Vector Extensions (AVX) is an opportunity time to examine the state of LLVM's x86 code generator and make improvements. Just as Intel took the AVX opportunity to unify the various x86 SIMD instruction sets, the LLVM community can take the opportunity to unify the x86 SIMD instruction specification, simplifying things for the casual developer and reducing errors. In this article we describe a new design for LLVM's x86 SIMD instruction descriptions, explaining why a redesign is desirable and providing a step-by-step walk-through of the hows and whys of the implementation. This new design will replace the existing LLVM x86 MMX and SSE specifications. 2. Motivation Why do a redesign? It's a reasonable question. In fact the original concept for the AVX specification assumed it would be a relatively simple matter to add new instruction class templates on top of the existing MMX and SSE infrastructure. However, we soon discovered that doing so would result in tremendous redundancy with equally tremendous potential for things to get out of sync and lead to subtle bugs down the road. The maintenance would become impractical. There is potential for this to become exponentially worse if instructions such as those proposed for the Larrabee architecture make it into production processors. 2.1 Understanding the x86 Specification Before going further, we need to have a baseline understanding of how the existing x86 MMX and SSE specifications work. We'll begin with the simplest example. Take the specification for SSE ADD: // SSE1 defm ADD : basic_sse1_fp_binop_rm<0x58, "add", fadd, int_x86_sse_add_ss, 1>; That looks easy enough. It's boiled down to the bare essentials of the instruction: an opcode, an asm mnemonic, an SDNode to match in a pattern, an intrinsic operator to match for "hard-coded" uses and a boolean indicating whether the operation is commutative (this becomes clear when examining the definition of basic_sse1_fp_binop_rm, which we will do shortly). The superclass name implies that it's an sse1 floating-point instruction with two operands that is encoded with a register destination and either a register or memory operand. Gleaning that from the name is mostly straightforward except for the "rm" part, which becomes clearer when we examine the definition of basic_sse1_fp_binop_rm: /// basic_sse1_fp_binop_rm - SSE1 binops come in both scalar and vector forms. /// /// In addition, we also have a special variant of the scalar form here to /// represent the associated intrinsic operation. This form is unlike the /// plain scalar form, in that it takes an entire vector (instead of a scalar) /// and leaves the top elements undefined. /// /// These three forms can each be reg+reg or reg+mem, so there are a total of /// six "instructions". /// let Constraints = "$src1 = $dst" in { multiclass basic_sse1_fp_binop_rm opc, string OpcodeStr, SDNode OpNode, Intrinsic F32Int, bit Commutable = 0> { // Scalar operation, reg+reg. def SSrr : SSI { let isCommutable = Commutable; } // Scalar operation, reg+mem. def SSrm : SSI; // Vector operation, reg+reg. def PSrr : PSI { let isCommutable = Commutable; } // Vector operation, reg+mem. def PSrm : PSI; // Intrinsic operation, reg+reg. def SSrr_Int : SSI { let isCommutable = Commutable; } // Intrinsic operation, reg+mem. def SSrm_Int : SSI; } } That's a lot of code, but it is relatively simple to understand once one get the hang of TableGen syntax. The class takes the arguments we passed from the "ADD" defm above. The multiclass automatically generates a set of definitions, appending the name associated with each "def" to the name of the defm, in this case, "ADD." So "defm ADD" results in the following: def ADDSSrr<...>; def ADDSSrm<...>; def ADDPSrm<...>; def ADDPSrm<...>; def ADDSSrr_Int<...>; def ADDSSrm_Int<...>; Now it becomes a little clearer what that "rm" in the multiclass name means. It's a mnemonic to remember that the multiclass generates variants of the x86 instruction with a register destination and a register or memory source operand. The analogue "mr" mnemonic represents an instruction with either a register or memory destination and a register source. Multiclasses and defms are tremendously useful and reduce a lot of tedious coding effort. Let's look at one of the defs inside the multiclass: // Scalar operation, reg+reg. def SSrr : SSI { let isCommutable = Commutable; } The SSrr def passes the opcode to the SSI class along with some more information. It tells SSI that the instruction uses "MRMSrcReg" encoding, basically meaning that the x86 ModR/M byte encodes either a source register or a source memory address (with an optional SIB byte -- see your handy Intel x86 programmer's manual for further fun). This information is important for the JIT and anything else that creates object code. Next are a couple of "dag" constructs. The first specifies the outputs of the instruction (one FR32 32-bit float scalar register in this case) while the second specifies inputs (two 32-bit float scalar registers). The names "$dst," "$src1" and "$src2" identify the operands. After that the def passes a generated asm string template to SSI. In this case it takes the asm mnemonic passed in ("add") and prepends it to a rather long string ("ss\t{$src2, $dst|$dst, $src2}"). The important thing to understand about this string is the pipe in the middle. This separates two alternatives for asm encoding. One can specify x86 instructions in one of two ways: with Intel syntax where the destination is the left-most operand (like a = b + c) or AT&T syntax, where the destination is the rightmost operand (like b + c -> a). The TableGen system knows to generate two separate AsmWriters and uses the pipe to determine the alternative syntaxes. So in the end this fragment of code generates the following two asm strings: addss\t$src2, $dst // AT&T addss\t$dst, $src2 // Intel So what happened to $src1? Since most x86 instructions are two-address, there is an inherent constraint that the destination of the instruction must be the same as one of its sources, in this case a register. This is expressed in TableGen with the "let Constraints = "$src1 = $dst" in" construct immediately preceding the definition of basic_sse1_fp_binop_rm. TableGen will generate code to match an ADD only if some source operand can be matched to the destination operand. The "isCommutable" parameter tells TableGen whether it can commute the operands of the instruction to get things into the right form. The final argument to the SSI class is the key to doing the pattern-matching code generation. It's a construct describing what SelectionDAG structure can be implemented with an ADD instruction. For ADDSSrr, the DAG specification is: [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))] It's actually a list of DAGs but that has no consequence for the existing SSE specification as the lists is always size one. This basically says a DAG that has a destination register operand whose value comes from some operation (fadd as passed from the defm) on two register operands can be matched to an ADDSSrr instruction, provided that $src1 is the same register as $dst. Actually, because isCommutable is set, it is sufficient for one of $src1 or $src2 to be the same register as $dst. These patterns are generally the most difficult things to get right in a TableGen specification. All of the other stuff like opcodes, encoding specifications and so on can be read right out of the Intel instruction manuals. However, there are many different ways one might match to an ADD or any other instruction and care must be taken to either cover all the cases or manually canonicalize the DAG to a small number of patterns. Diving down further, let's look at the SSI class: class SSI o, Format F, dag outs, dag ins, string asm, list pattern> : I, XS, Requires<[HasSSE1]>; This is essentially a pass-through to the common x86 "I" class for instructions: class I o, Format f, dag outs, dag ins, string asm, list pattern> : X86Inst { let Pattern = pattern; let CodeSize = 3; } class X86Inst opcod, Format f, ImmType i, dag outs, dag ins, string AsmStr> : Instruction { let Namespace = "X86"; bits<8> Opcode = opcod; Format Form = f; bits<6> FormBits = Form.Value; ImmType ImmT = i; bits<3> ImmTypeBits = ImmT.Value; dag OutOperandList = outs; dag InOperandList = ins; string AsmString = AsmStr; // // Attributes specific to X86 instructions... // bit hasOpSizePrefix = 0; // Does this inst have a 0x66 prefix? bit hasAdSizePrefix = 0; // Does this inst have a 0x67 prefix? bits<4> Prefix = 0; // Which prefix byte does this inst have? bit hasREX_WPrefix = 0; // Does this inst requires the REX.W prefix? FPFormat FPForm; // What flavor of FP instruction is this? bits<3> FPFormBits = 0; bit hasLockPrefix = 0; // Does this inst have a 0xF0 prefix? bits<2> SegOvrBits = 0; // Segment override prefix. } The X86inst class is where we describe the nitty-gritty of binary encoding. The important things to pick out are the *Prefix members and the Form specifier. We've seen a value of "Form" before in MRMSrcReg. The *Prefix members remember which instruction prefix bytes, together with the opcode, complete the instruction operation sequence. For an SSE ADD instruction, the opcode is always 0x58. But does this mean a single-precision add, a double-precision add, a scalar add or a vector add? That's what the prefix bytes tell us. There are seven basic prefixes for SSE instructions, described by the following classes: // Prefix byte classes which are used to indicate to the ad-hoc machine code // emitter that various prefix bytes are required. class OpSize { bit hasOpSizePrefix = 1; } // 0x66 class REX_W { bit hasREX_WPrefix = 1; } class TB { bits<4> Prefix = 1; } // Two-byte opcode: 0x0F class XD { bits<4> Prefix = 11; } // SSE double-precision: 0xF2 0x0F class XS { bits<4> Prefix = 12; } // SSE single-precision: 0xF3 0x0F class T8 { bits<4> Prefix = 13; } // SSE 0x0F 0x38 class TA { bits<4> Prefix = 14; } // SSE 0x0F 0x3A SSE1 and SSE2 instructions have fairly regular encoding. All those instructions have a Two-Byte (TB) prefix to distinguish them from the other scalar instructions. For example, in the case of SSE ADD, the POP instruction also has opcode 0x58 (consult the x86 Programmer's Manual opcode tables for the gory details) so the TB prefix distinguished the SSE ADD from POP. SSE1 instructions operate on single-precision data. The vector version ADDPS uses no additional prefixes so its operation encoding is "0x0F 0x58." The scalar ADDSS needs another prefix to distinguish itself from the vector version, so it uses the 0xF3 "XS" (for XMM Single) prefix, giving it an operation encoding of "0xF3 0x0F 0x58." SSE2 instructions operate on double-precision data. To indicate this the floating point SSE2 vector instructions use the 0x66 "OpSize" operand-size override prefix, yielding the operation encoding "0x66 0x0F 0x58" for ADDPD. The ADDSD scalar double-precision instruction uses the 0xF2 "XD" (XMM Double) prefix, making its operation encoding "0xF2 0x0F 0x58." Later SSE instruction sets are less regular in their use of prefixes and that needs to be accounted for in the instruction specifications. We will see that later on. The SSI class adds the XS prefix to its subclasses, marking them a scalar single-precision instruction. The PSI class similarly adds the TB prefix to packed instructions. In addition, both SSI and PSI add a Requires<> predicate HasSSE1, which tells TableGen to only match the pattern if the target processor implements SSE1 instructions. 2.2 Redundancy in the SSE Specification Sending ourselves back down the class hierarchy, if we compare the defs in the basic_sse1_fp_binop_rm multiclass we some a fair amount of redundancy: // Scalar operation, reg+reg. def SSrr : SSI { let isCommutable = Commutable; } // Scalar operation, reg+mem. def SSrm : SSI; // Vector operation, reg+reg. def PSrr : PSI { let isCommutable = Commutable; } // Vector operation, reg+mem. def PSrm : PSI; Both "rr" defs pass MRMSrcReg as the ModR/M encoding, both "rm" defs pass MRMSrcMem. There is a direct mapping from the def name "SSrr, PSrr, etc." to the asm string appended to the mnemonic passed from the defm (SS -> "ss", PD -> "pd" and so on). The operand DAGs look identical among the rr and rm variants and only the class of the operand (VR128, f128mem) varies among all the defs. Most importantly, the patterns for all four instructions look almost identical: [(set FR32:$dst, (OpNode FR32:$src1, FR32:$src2))] // SSrr [(set VR128:$dst, (v4f32 (OpNode VR128:$src1, VR128:$src2)))] // PSrr [(set FR32:$dst, (OpNode FR32:$src1, (load addr:$src2)))] // SSrm [(set VR128:$dst, (OpNode VR128:$src1, (memopv4f32 addr:$src2)))] // PSrm All the rr patterns have the basic structure: [(set :$dst, (OpNode :$src1, :$src2))] and all the rm patterns have the basic structure: [(set :$dst, (OpNode :$src1, ( addr:$src2)))] where the identifiers between angle brackets indicate a placeholder that stands in for a specific object depending on the type of the operation (scalar, vector, single-precision, etc.). Note that PSrr has an extra type specifier. That can be accounted for by generalizing the patterns: [(set :$dst, (TYPE (OpNode :$src1, :$src2)))] There's no harm in specifying a type even if it could be inferred from the dag. In fact we sometimes want to do so as an error check. Now let's look at the SSE2 ADD: defm ADD : basic_sse2_fp_binop_rm<0x58, "add", fadd, int_x86_sse2_add_sd, 1>; This looks distressingly similar to the SSE1 ADD defm. And in fact basic_sse2_fp_binop_rm also looks familiar: /// basic_sse2_fp_binop_rm - SSE2 binops come in both scalar and vector forms. /// /// In addition, we also have a special variant of the scalar form here to /// represent the associated intrinsic operation. This form is unlike the /// plain scalar form, in that it takes an entire vector (instead of a scalar) /// and leaves the top elements undefined. /// /// These three forms can each be reg+reg or reg+mem, so there are a total of /// six "instructions". /// let Constraints = "$src1 = $dst" in { multiclass basic_sse2_fp_binop_rm opc, string OpcodeStr, SDNode OpNode, Intrinsic F64Int, bit Commutable = 0> { // Scalar operation, reg+reg. def SDrr : SDI { let isCommutable = Commutable; } // Scalar operation, reg+mem. def SDrm : SDI; // Vector operation, reg+reg. def PDrr : PDI { let isCommutable = Commutable; } // Vector operation, reg+mem. def PDrm : PDI; // Intrinsic operation, reg+reg. def SDrr_Int : SDI { let isCommutable = Commutable; } // Intrinsic operation, reg+mem. def SDrm_Int : SDI; } } There's very little difference here from basic_sse1_fp_binop_rm. Some D's instead of S's and they subclass from SDI/PDI rather than SSI/PSI. SDI looks very much like SSI: class SDI o, Format F, dag outs, dag ins, string asm, list pattern> : I, XD, Requires<[HasSSE2]>; The only difference being the XD prefix rather than XS and the different ISA requirement (HasSSE2). That's a lot of code to specify eight instructions that look very similar to each other -- ADDSSrr, ADDSSrm, ADDPSrr, ADDPSrm, ADDSDrr, ADDSDrm, ADDPDrr, ADDPDrm. It's unlikely that anyone will go mucking around with the ADD instruction patterns in the near future, but consider a more complicated case like a vector shuffle/permute: let Constraints = "$src1 = $dst" in { let isConvertibleToThreeAddress = 1 in // Convert to pshufd def SHUFPSrri : PSIi8<0xC6, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i8imm:$src3), "shufps\t{$src3, $src2, $dst|$dst, $src2, $src3}", [(set VR128:$dst, (v4f32 (shufp:$src3 VR128:$src1, VR128:$src2)))]>; } let Constraints = "$src1 = $dst" in { def SHUFPDrri : PDIi8<0xC6, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i8imm:$src3), "shufpd\t{$src3, $src2, $dst|$dst, $src2, $src3}", [(set VR128:$dst, (v2f64 (shufp:$src3 VR128:$src1, VR128:$src2)))]>; } Again, the patterns look very similar. "shufp" is a small fragment that matches an extractelement LLVM instruction and checks whether the provided shuffle vector to the LLVM extractelement instruction is something expressible by the SHUFP[SD] instructions. Essentially, it makes sure that the lower elements of the result come from the $src1/$dst input and the upper elements comes from the $src2 input. It looks like this: def shufp : PatFrag<(ops node:$lhs, node:$rhs), (vector_shuffle node:$lhs, node:$rhs), [{ return X86::isSHUFPMask(cast(N)); }], SHUFFLE_get_shuf_imm>; This essentially gets "inlined" into patterns where it is used, yielding this for SHUFPS: [(set VR128:$dst, (v4f32 (vector_shuffle VR128:$src1, VR128:$src2 $src3)))] So we could save some code replication by unifying the patterns as with ADD above. However, a more serious issue presents itself as we explore further. The following patterns appear later in the SSE specification (reordered for pedagogy): // Special unary SHUFPSrri case. def : Pat<(v4f32 (pshufd:$src3 VR128:$src1, (undef))), (SHUFPSrri VR128:$src1, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>; // Special unary SHUFPDrri case. def : Pat<(v2i64 (pshufd:$src3 VR128:$src1, (undef))), (SHUFPDrri VR128:$src1, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; // Special unary SHUFPDrri case. def : Pat<(v2f64 (pshufd:$src3 VR128:$src1, (undef))), (SHUFPDrri VR128:$src1, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; // Special binary v4i32 shuffle cases with SHUFPS. def : Pat<(v4i32 (shufp:$src3 VR128:$src1, (v4i32 VR128:$src2))), (SHUFPSrri VR128:$src1, VR128:$src2, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; def : Pat<(v4i32 (shufp:$src3 VR128:$src1, (bc_v4i32 (memopv2i64 addr:$src2)))), (SHUFPSrmi VR128:$src1, addr:$src2, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; // Special binary v2i64 shuffle cases using SHUFPDrri. def : Pat<(v2i64 (shufp:$src3 VR128:$src1, VR128:$src2)), (SHUFPDrri VR128:$src1, VR128:$src2, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; // vector_shuffle v1, v2 <4, 5, 2, 3> using SHUFPSrri (we prefer movsd, but // fall back to this for SSE1) def : Pat<(v4f32 (movlp:$src3 VR128:$src1, (v4f32 VR128:$src2))), (SHUFPSrri VR128:$src2, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>; There's another useful TableGen concept here: the Pat<> pattern. Pat<> patterns are essentially a way to map multiple patterns onto one target machine instruction. It's as if we had passed multiple patterns to match to PSIi8 and PDIi8. "pshufd" is a more specialized variant of shuf: def pshufd : PatFrag<(ops node:$lhs, node:$rhs), (vector_shuffle node:$lhs, node:$rhs), [{ return X86::isPSHUFDMask(cast(N)); }], SHUFFLE_get_shuf_imm>; X86::isPSHUFDMask simply checks for a different sequence of elements in the vector_shuffle mask vector. The first Pat<> pat above: // Special unary SHUFPSrri case. def : Pat<(v4f32 (pshufd:$src3 VR128:$src1, (undef))), (SHUFPSrri VR128:$src1, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>; says that a vector_shuffle of a source vector and some undefined vector (i.e. anything at all) whose mask satisfies the X86::isPSHUFDMask predicate can be implemented with a SHUFPSrri instruction. "SHUFFLE_get_shuf_imm" is the TableGen way of invoking target-specific code to convert the vector_extract vector mask to the immediate control bits expected by the SHUFPS instruction. If we examine the set of Pat<> patterns above, we notice some non-orthogonality. There are unary patterns for SHUFPS and SHUFPD operating on floats: // Special unary SHUFPSrri case. def : Pat<(v4f32 (pshufd:$src3 VR128:$src1, (undef))), (SHUFPSrri VR128:$src1, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>; // Special unary SHUFPDrri case. def : Pat<(v2f64 (pshufd:$src3 VR128:$src1, (undef))), (SHUFPDrri VR128:$src1, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; But SHUFPD has an additional pattern for operating on integers: // Special binary v2i64 shuffle cases using SHUFPDrri. def : Pat<(v2i64 (shufp:$src3 VR128:$src1, VR128:$src2)), (SHUFPDrri VR128:$src1, VR128:$src2, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE2]>; We could just as well add a similar pattern for SHUFPS but it doesn't currently exist. Why not? One can only guess. Perhaps someone saw the opportunity to optimize a sequence of packed 64-bit integers but didn't have time to implement the same for packed 32-bit integers. Perhaps it was just an oversight. But in any event, the capabilities of generating 32-bit code and 64-bit code diverge here. One can see similar divergence in the "binary" Pat<> patterns. SHUFPS has an extra pattern that includes a bitconvert. The SHUFPD analogue is missing. This extra pattern for SHUFPS: // vector_shuffle v1, v2 <4, 5, 2, 3> using SHUFPSrri (we prefer movsd, but // fall back to this for SSE1) def : Pat<(v4f32 (movlp:$src3 VR128:$src1, (v4f32 VR128:$src2))), (SHUFPSrri VR128:$src2, VR128:$src1, (SHUFFLE_get_shuf_imm VR128:$src3))>, Requires<[HasSSE1]>; could also have an analogue for SHUFPD to essentially do a MOVLPD-like operation from register to register (MOVLPD only moves to/from memory). So the maintenance headaches of the current scheme are already apparent. The pattern space is not orthogonal and LLVM is likely missing opportunities to generate better instruction sequences. We would like to unify the patterns into a common framework that can automatically generate sets of orthogonal patters with all of the varying vector and scalar types available to x86 SIMD instructions. The next post will start to dive into the design of the x86 SIMD reorganization.