[llvm] [LLVM][MDL] First integration of MDL with LLVM (PR #78002)

Wed Dec 11 16:42:35 PST 2024

================
@@ -0,0 +1,2483 @@
+
+# MPACT Microarchitecture Description Language
+
+Reid Tatge          [tatge at google.com](mailto:tatge at google.com)
+
+
+## **Goals for a Machine Description Language**
+
+Modern processors are complex: multiple execution pipelines, dynamically dispatched, out-of-order execution, register renaming, forwarding networks, and (often) undocumented micro-operations. Instruction behaviors, including micro-operations, often can’t be _statically_ modeled in an accurate way, but only _statistically_ modeled. In these cases, the compiler’s model of a microarchitecture (Schedules and Itineraries in LLVM) is effectively closer to a heuristic than a formal model. And this works quite well for general purpose microprocessors.
+
+However, modern accelerators have different and/or additional dimensions of complexity: VLIW instruction issue, unprotected pipelines, tensor/vector ALUs, software-managed memory hierarchies. And it's more critical that compilers can precisely model the details of that complexity. Currently, LLVM’s Schedules and Itineraries aren’t adequate for directly modeling many accelerator architectural features.
+
+So we have several goals:
+
+1. We want a first-class, purpose-built, intuitive language that captures all the scheduling and latency details of the architecture - much like Schedules and Itineraries - that works well for all current targets, but also for a large class of accelerator architectures..
+2. The complexity of the specification should scale with the complexity of the hardware. 
+3. The description should be succinct, avoiding duplicated information, while reflecting the way things are defined in a hardware architecture specification.
+4. We want to generate artifacts that can be used in a machine-independent way for back-end optimization, register allocation, instruction scheduling, etc - anything that depends on the behavior and constraints of instructions.
+5. We want to support a much larger class of architectures in one uniform manner.
+
+For this document (and language), the term “instructions” refers to the documented instruction set of the machine, as represented by LLVM instructions descriptions, rather than undocumented micro-operations used by many modern microprocessors. 
+
+The process of compiling a processor’s machine description creates several primary artifacts:
+
+*   For each target instruction (described in td files), we create an object that describes the detailed behaviors of the instruction in any legal context (for example, on any functional unit, on any processor)
+*   A set of methods with machine independent APIs that leverage the information associated with instructions to inform and guide back-end optimization passes.  
+
+The details of the artifacts are described later in this document.
+
+_Note: A full language grammar description is provided in an appendix.  Snippets of grammar throughout the document only provide the pertinent section of the grammar, see the Appendix A for the full grammar._
+
+The proposed language can be thought of as an _optional extension to the LLVM machine description_. For most upstream architectures, the new language offers minimal benefit other than a much more succinct way to specify the architecture vs Schedules and Itineraries.  But for accelerator-class architectures, it provides a level of detail and capability not available in the existing tablegen approaches.
+
+### **Background**
+
+Processor families evolve over time. They accrete new instructions, and pipelines change - often in subtle ways - as they accumulate more functional units and registers; encoding rules change; issue rules change. Understanding, encoding, and using all of this information - over time, for many subtargets - can be daunting.  When the description language isn’t sufficient to model the architecture, the back-end modeling evolves towards heuristics, and leads to performance issues or bugs in the compiler. And it certainly ends with large amounts of target specific code to handle “special cases”. 
+
+LLVM uses the [TableGen](https://llvm.org/docs/TableGen/index.html) language to describe a processor, and this is quite sufficient for handling most general purpose architectures - there are 20+ processor families currently upstreamed in LLVM! In fact, it is very good at modeling instruction definitions, register classes, and calling conventions.  However, there are “features” of modern accelerator micro-architectures which are difficult or impossible to model in tablegen.
+
+We would like to easily handle:
+
+*   Complex pipeline behaviors
+    *   An instruction may have different latencies, resource usage, and/or register constraints on different functional units or different operand values.
+    *   An instruction may read source registers more than once (in different pipeline phases).
+    *   Pipeline structure, depth, hazards, scoreboarding, and protection may differ between family members.
+*   Functional units
+    *   Managing functional unit behavior differences across subtargets of a family.
+    *   Impose different register constraints on instructions (local register files, for example).
+    *   Share execution resources with other functional units (such as register ports)
+    *   Functional unit clusters with separate execution pipelines.
+*   VLIW Architecture 
+    *   issue rules can get extremely complex, and can be dependent on encoding, operand features, and pipeline behavior of candidate instructions.
+
+More generally, we’d like specific language to:
+
+*   Support all members of a processor family
+*   Describe CPU features, parameterized by subtarget
+    *   Functional units
+    *   
+    *   Pipeline structure and behaviors
+
+Since our emphasis is on easily supporting accelerators and VLIW processors, in addition to supporting all existing targets, much of this is overkill for most upstreamed CPUs.  CPU’s typically have much simpler descriptions, and don’t require much of the capability of our machine description language.  Incidentally, MDL descriptions of these targets (generated automatically from the tablegen Schedules and Itineraries) are typically much more concise than the original tablegen descriptions.
+
+### **Approach - “Subunits” and Instruction Behaviors**
+
+We developed a DSL that allows us to describe an arbitrary processor microarchitecture in terms that reflect what is typically documented in the hardware specification. The MDL compiler creates a database that provides microarchitecture behavior information that can _automatically_ inform critical back-end compiler passes, such as instruction scheduling and register allocation, in a machine-independent way. 
+
+It’s important to note the difference between an instruction definition, as described in LLVM, and an instruction instance.  Generally, instructions defined in LLVM share the same behaviors across all instances of that instruction in a single subtarget. Exceptions to this require non-trivial code in the back-end to model variant behavior.  In VLIW and accelerator architectures, each generated instance of an instruction can have different behaviors, depending on how it's issued, its operand values, the functional unit it runs on, and the subtarget. So we provide a way to model those differences in reasonable ways.
+
+The MDL introduces the concept of a “subunit” to abstractly represent a class of instructions with the same behaviors. Subunit instances concretely connect instructions to descriptions of their behaviors, _and_ to the functional units that they can be issued on. A subunit is vaguely analogous to collections of SchedRead and SchedWrite resources. 
+
+Naively, we could create unique subunits for each behavior for each instruction, the set of which would enumerate the cross-product of the instruction’s behaviors on every subtarget, functional unit, and issue slot. But subunits can be specialized by subtarget, functional unit, and each instruction definition, so a single subunit definition can properly describe behaviors for sets of instructions in many different contexts.
+
+A key aspect of this language design is that we can explicitly represent the potentially polymorphic behavior of each generated instance of any instruction, on any functional unit, on any subtarget.  The representation also comprehends that this information can vary between each of an instruction’s instances.
+
+We define a subunit as an object that defines the _behavior sets_ of an instruction instance in all legal contexts (functional units, issue slots), for each subtarget.  In particular, we want to know:
+
+*   What resources are shared or reserved, in what pipeline phases.
+    *   Encoding resources
+    *   Issue slot(s) used
+    *   Functional unit resources
+    *   Shared/private busses, register ports, resources, or pooled resources
+*   What registers are read and written, in which pipeline phases (ie, the instruction’s “latencies”)
+*   What additional register constraints does a functional unit instance impose on an instruction’s registers.
+
+The critical artifact generated by the MDL compiler is a set of instruction behaviors for each instruction definition.  For each subtarget, for each instruction, we generate a list of every possible behavior of that instruction on that CPU.  While this sounds daunting, in practice it's rare to have more than a few behaviors for an instruction, and most instruction definitions share their behaviors with many other instructions, across subtargets.
+
+## **Overview of a Processor Family Description**
+
+This document generally describes the language in a bottom up order - details first.  But let's start with a brief tops-down overview of what a processor family description looks like, without going into details about each part.
+
+A minimal processor family description has the following components:
+
+*   A set of CPU definitions - one for each subtarget.
+*   A set of functional unit template definitions,
+*   A set of subunit template definitions,
+*   A set of latency template definitions.
+
+A CPU definition specifies a set of functional unit instances that define the processor, as well as pipeline descriptions, issue slot resources, and binding of functional units to issue slots.  Each functional unit instance can be parameterized and specialized.
+
+A functional unit template specifies a set of subunits instances implemented by an instance of the functional unit.  It can be parameterized and specialized for each instance in different CPUs.
+
+A subunit template abstractly defines a set of related operations that have similar behaviors. They specify these behaviors with a set of “latency” instances.  They can also be parameterized and specialized for each instance in different functional unit templates.  Subunits tie instruction definitions both to functional units on which they can execute, and instruction behaviors described in latency templates.
+
+A latency template defines the pipeline behavior of a set of instructions.  It can be parameterized and specialized for each instance in a subunit instance.  It is also specialized for each instruction that is tied to it (through a subunit).  A latency rule, at a minimum, specifies when each operand is read and written in the execution pipeline.
+
+Here’s a very simple example of a trivial CPU, with three functional units, two issue slots, and a four-deep pipeline:
+
+```
+    cpu myCpu {
+    	phases cpu { E1, E2, E3, E4 };
+        issue slot1, slot2;
+    	func_unit FU_ALU my_alu1();    	// an instance of FU_ALU
+    	func_unit FU_ALU my_alu2();    	// an instance of FU_ALU
+    	func_unit FU_LOAD my_load();   	// an instance of FU_LOAD
+    }
+
+    func_unit FU_ALU() {            		// template definition for FU_ALU
+    	subunit ALU();              	// an instance of subunit ALU
+    }
+    func_unit FU_LOAD() {               	// template definition for FU_LOAD
+    	subunit LOAD();                	// an instance of subunit LOAD
+    }
+
+    subunit ALU() {                      	// template definition for ALU
+    	latency LALU();                	// an instance of latency LALU
+    }
+    subunit LOAD() {                     	// template definition for LOAD
+    	latency LLOAD();               	// an instance of latency LLOAD
+    }
+
+    latency LALU() {                     	// template definition for LALU
+    	def(E2, $dst);  use(E1, $src1);  use(E1, $src2); 
+    }
+    latency LLOAD() {                    	// template definition for LLOAD
+    	def(E4, $dst);  use(E1, $addr);
+    }
+```
+
+A more complete description of each part of this description is provided in the section “Defining a Processor Family”.
+
+**Defining an ISA**
+
+We need to map a microarchitecture model back to LLVM instruction, operand, and register definitions.  So, the MDL contains constructs for defining instructions, operands, registers, and register classes.  
+
+When writing a target machine description, its not necessary to write descriptions for instructions, operands, and registers - we scrape all of this information about the CPU ISA from the tablegen output as part of the build process, and produce an MDL file which contains these definitions. The machine description compiler uses these definitions to tie architectural information back to LLVM instructions, operands, and register classes.
+
+We will describe these language features here, primarily for completeness.
+
+### **Defining Instructions**
+
+Instruction definitions are scraped from tablegen files, and provide the following information to the MDL compiler for each instruction:
+
+*   The instruction’s name (as defined in the td files)
+*   Its operands, with the operand type and name provided in the order they are declared, and indicating whether each is an input or output of the instruction.
+*   A set of “legal” subunit definitions (a “subunit” is described later in this document)
+*   An optional list of instructions derived from this one.
+
+As in tablegen, an operand type must be either an operand name defined in the td description, a register class name defined in the td description, or simply a defined register name. If the operand type is a register name, the operand name is optional (and ignored) (these register operands are used to represent implied operands in LLVM instructions). 
+
+Grammar:
+
+```
+    instruction_def  : 'instruction' IDENT
+                          '(' (operand_decl (',' operand_decl)*)? ')'
+                          '{'
+                              ('subunit' '(' name_list ')' ';' )?
+                              ('derived' '(' name_list ')' ';' )?
+                          '}' ';'? ;
+    operand_decl     : ((IDENT (IDENT)?) | '...') ('(I)' | '(O)')? ;
+```
+
+An example:
+
+
+```
+    instruction ADDSWri(GPR32 Rd(O), GPR32sp Rn(I), addsub_shifted_imm32 imm(I), NZCV(O)) {
+        subunit(sub24,sub26);
+    }
+```
+
+This describes an ARM add instruction that has two defined input operands (Rn, imm), one defined output operand (Rd), and one implicit output operand (NZCV), which is associated with two subunits (sub24, sub26).
+
+### **Defining Operands**
+
+Operand definitions are scraped from tablegen files (like instructions), and provide the following information to the MDL compiler for each operand:
+
+*   The operand’s name,
+*   Its sub-operands, with the operand type and operand name provided in the order they are declared.  Note that operand names are optional, and if not present we would refer to these by their sub-operand id (0, 1, etc),
+*   The operand’s value type.
+
+As in LLVM, an operand definition’s sub-operand types may in turn refer to other operand definitions. (Note that operand’s sub-operands are declared with the same syntax as instruction operands.)
+
+Grammar:
+
+```
+    operand_def      : 'operand' IDENT
+                          '(' (operand_decl (',' operand_decl)*)? ')'
+                          '{' operand_type '}' ';'? ;
+```
+
+Some examples:
+
+```
+    operand GPR32z(GPR32 reg) { type(i32); } 
+    operand addsub_shifted_imm32(i32imm, i32imm) { type(i32); }
+```
+
+### **Defining Registers and Register Classes**
+
+Registers and register classes are scraped from tablegen output.  We provide a general method in the language to define registers and classes of registers which can reflect the registers defined in tablegen. 
+
+Grammar:
+
+```
+    register_def     : 'register' register_decl (',' register_decl)* ';' ;
+    register_decl    : IDENT ('[' range ']')? ;
+    register_class   : 'register_class' IDENT
+                            '{' register_decl (',' register_decl)* '}' ';'? 
+                     | 'register_class' IDENT '{' '}' ';'? ;
+```
+
+Examples:
+
+```
+    register a0, a1, a2, a3;                 // 4 registers
+    register a[4..7];                        // definition of a4, a5, a6, and a7
+
+    register_class low3 { a0, a1, a2 };      // a class of 3 registers
+    register_class high5 { a[3..7] };        // a class of a3, a4, a5, a6, and a7
+```
+
+The order of register definitions is generally insignificant in the current MDL - we use the register names defined in LLVM, and there’s no cases in the MDL where we depend on order.  Register “ranges”, such as “a[0..20]” are simply expanded into the discrete names of the entire range of registers.
+
+### **Defining Derived Operands**
+
+LLVM doesn’t necessarily provide all the information we want to capture about an instruction, so the MDL allows for defining “derived” operands with which we can associate named values.  A derived operand is essentially an alias to one or more LLVM-defined operands (or derived operands), and provides a mechanism to add arbitrary attributes to operand definitions. Derived operands also allow us to treat a set of operand types as identical in latency reference rules (so you don’t have to specify a long set of operand types for some references.)
+
+Grammar:
+
+```
+    derived_operand_def     : 'operand' IDENT (':' IDENT)+  ('(' ')')?
+                                  '{' (operand_type | operand_attribute)* '}' ';'? ;
+    operand_attribute_stmt  : 'attribute' IDENT '=' (snumber | tuple)
+                                ('if' ('lit' | 'address' | 'label')
+                                ('[' pred_value (',' pred_value)* ']' )? )? ';' `;
+
+    pred_value              : snumber
+                            | snumber '..' snumber
+                            | '{' number '}' ;
+    tuple                   : '[' snumber (',' snumber)* ']' ;
+```
+
+#### **Derivation**
+
+Each derived operand is declared with one or more “base” operands, for which it is an alias. Circular or ambiguous derivations are explicitly disallowed - there must be only one derivation path for a derived operand to any of its base concrete operands.
+
+Derived operands are used in place of their base operands in operand latency rules in latency templates (described later). This allows a rule to match a set of operands, rather than a single operand, and also can provide access to instruction attributes to the latency rule.
+
+#### **Derived operand attributes**
+
+Derived operand attributes associate name/value-tuple pairs with the operand type. Tuples are appropriate when an attribute is used as a set of masks for resource sharing, described later.  
+
+Some examples:
+
+```
+    attribute my_attr_a = 1;
+    attribute my_attr_b = 123;
+    attribute my_tuple  = [1, 2, 3];
+```
+
+Attributes can have predicates that check if the operand contains a data address, a code address, or any constant.  Additionally, attributes can have multiple definitions with different predicates, with the first “true” predicate determining the final value of the attribute for that operand instance:
+
+```
+    attribute my_attr = 5 if address;    // if operand is a relocatable address
+    attribute my_attr = 2 if label;      // if operand is a code address
+    attribute my_attr = 3 if lit;        // if operand is any literal constant
+```
+
+
+Predicates for literal constants can also take an optional list of “predicate values”, where each predicate value is either an integer, a range of integers, or a “mask”. Mask predicate values are explicitly checking for non-zero bits:
+
+```
+    attribute my_attr = 5 if lit [1, 2, 4, 8];    // looking for specific values
+    attribute my_attr = 12 if lit [100..200];     // looking for a range of values
+    attribute my_attr = 1 if lit [{0x0000FFFF}];  // looking for a 16 bit number
+    attribute my_attr = 2 if lit [{0x00FFFF00}];  // also a 16-bit number!
+    attribute my_attr = 3 if lit [1, 4, 10..14, 0x3F800000, {0xFF00FF00}]; 
----------------
reidtatge wrote:

You can use masks ( {....} ) for the "even" example: {0xFFFFFFFE} :-)
Clearly, this isn't fully general. The goal was to support just about anything I had ever seen.  Note that I also didn't want to mess directly with floating point, but you can enumerate the binaries for that.  So I picked the syntax to reflect the bits that go in the instruction.

Of course we can always make it more comprehensive.

https://github.com/llvm/llvm-project/pull/78002