[llvm] [LLVM][MDL] First integration of MDL with LLVM (PR #78002)

Tue Jan 23 13:38:29 PST 2024

================
@@ -0,0 +1,2825 @@
+
+# MPACT Microarchitecture Description Language
+
+Reid Tatge          [tatge at google.com](mailto:tatge at google.com)
+
+
+## **Goals for a Machine Description Language**
+
+Modern processors are complex: multiple execution pipelines, dynamically dispatched, out-of-order execution, register renaming, forwarding networks, and (often) undocumented micro-operations. Instruction behaviors, including micro-operations, often can’t be _statically_ modeled in an accurate way, but only _statistically_ modeled. In these cases, the compiler’s model of a microarchitecture (Schedules and Itineraries in LLVM) is effectively closer to a heuristic than a formal model. And this works quite well for general purpose microprocessors.
+
+However, modern accelerators have different and/or additional dimensions of complexity: VLIW instruction issue, unprotected pipelines, tensor/vector ALUs, software-managed memory hierarchies. And it's more critical that compilers can precisely model the details of that complexity. Currently, LLVM’s Schedules and Itineraries aren’t adequate for directly modeling many accelerator architectural features.
+
+So we have several goals:
+
+
+
+1. We want a first-class, purpose-built, intuitive language that captures all the scheduling and latency details of the architecture - much like Schedules and Itineraries - that works well for all current targets, but also for a large class of accelerator architectures..
+2. The complexity of the specification should scale with the complexity of the hardware. 
+3. The description should be succinct, avoiding duplicated information, while reflecting the way things are defined in a hardware architecture specification.
+4. We want to generate artifacts that can be used in a machine-independent way for back-end optimization, register allocation, instruction scheduling, etc - anything that depends on the behavior and constraints of instructions.
+5. We want to support a much larger class of architectures in one uniform manner.
+
+For this document (and language), the term “instructions” refers to the documented instruction set of the machine, as represented by LLVM instructions descriptions, rather than undocumented micro-operations used by many modern microprocessors. 
+
+ 
+
+The process of compiling a processor’s machine description creates several primary artifacts:
+
+
+
+*   For each target instruction (described in td files), we create an object that describes the detailed behaviors of the instruction in any legal context (for example, on any functional unit, on any processor)
+*   A set of methods with machine independent APIs that leverage the information associated with instructions to inform and guide back-end optimization passes.  
+
+The details of the artifacts are described later in this document.
+
+_Note: A full language grammar description is provided in an appendix.  Snippets of grammar throughout the document only provide the pertinent section of the grammar, see the Appendix A for the full grammar._
+
+The proposed language can be thought of as an _optional extension to the LLVM machine description_. For most upstream architectures, the new language offers minimal benefit other than a much more succinct way to specify the architecture vs Schedules and Itineraries.  But for accelerator-class architectures, it provides a level of detail and capability not available in the existing tablegen approaches.
+
+
+### **Background**
+
+Processor families evolve over time. They accrete new instructions, and pipelines change - often in subtle ways - as they accumulate more functional units and registers; encoding rules change; issue rules change. Understanding, encoding, and using all of this information - over time, for many subtargets - can be daunting.  When the description language isn’t sufficient to model the architecture, the back-end modeling evolves towards heuristics, and leads to performance issues or bugs in the compiler. And it certainly ends with large amounts of target specific code to handle “special cases”. 
+
+LLVM uses the [TableGen](https://llvm.org/docs/TableGen/index.html) language to describe a processor, and this is quite sufficient for handling most general purpose architectures - there are 20+ processor families currently upstreamed in LLVM! In fact, it is very good at modeling instruction definitions, register classes, and calling conventions.  However, there are “features” of modern accelerator micro-architectures which are difficult or impossible to model in tablegen.
+
+We would like to easily handle:
+
+
+
+*   Complex pipeline behaviors
+    *   An instruction may have different latencies, resource usage, and/or register constraints on different functional units or different operand values.
+    *   An instruction may read source registers more than once (in different pipeline phases).
+    *   Pipeline structure, depth, hazards, scoreboarding, and protection may differ between family members.
+*   Functional units
+    *   Managing functional unit behavior differences across subtargets of a family.
+    *   Impose different register constraints on instructions (local register files, for example).
+    *   Share execution resources with other functional units (such as register ports)
+    *   Functional unit clusters with separate execution pipelines.
+*   VLIW Architecture 
+    *   issue rules can get extremely complex, and can be dependent on encoding, operand features, and pipeline behavior of candidate instructions.** \
+**
+
+More generally, we’d like specific language to:
+
+
+
+*   Support all members of a processor family
+*   Describe CPU features, parameterized by subtarget
+    *   Functional units
+    *   Issue slots
+    *   Pipeline structure and behaviors
+
+Since our emphasis is on easily supporting accelerators and VLIW processors, in addition to supporting all existing targets, much of this is overkill for most upstreamed CPUs.  CPU’s typically have much simpler descriptions, and don’t require much of the capability of our machine description language.  Incidentally, MDL descriptions of these targets (generated automatically from the tablegen Schedules and Itineraries) are typically much more concise than the original tablegen descriptions.
+
+
+### **Approach - “Subunits” and Instruction Behaviors**
+
+We developed a DSL that allows us to describe an arbitrary processor microarchitecture in terms that reflect what is typically documented in the hardware specification. The MDL compiler creates a database that provides microarchitecture behavior information that can _automatically _inform critical back-end compiler passes, such as instruction scheduling and register allocation, in a machine-independent way. 
+
+It’s important to note the difference between an instruction definition, as described in LLVM, and an instruction instance.  Generally, instructions defined in LLVM share the same behaviors across all instances of that instruction in a single subtarget. Exceptions to this require non-trivial code in the back-end to model variant behavior.  In VLIW and accelerator architectures, each generated instance of an instruction can have different behaviors, depending on how it's issued, its operand values, the functional unit it runs on, and the subtarget. So we provide a way to model those differences in reasonable ways.
+
+The MDL introduces the concept of a “subunit” to abstractly represent a class of instructions with the same behaviors. Subunit instances concretely connect instructions to descriptions of their behaviors, _and_ to the functional units that they can be issued on. A subunit is vaguely analogous to collections of SchedRead and SchedWrite resources. 
+
+Naively, we could create unique subunits for each behavior for each instruction, the set of which would enumerate the cross-product of the instruction’s behaviors on every subtarget, functional unit, and issue slot. But subunits can be specialized by subtarget, functional unit, and each instruction definition, so a single subunit definition can properly describe behaviors for sets of instructions in many different contexts.
+
+A key aspect of this language design is that we can explicitly represent the potentially polymorphic behavior of each generated instance of any instruction, on any functional unit, on any subtarget.  The representation also comprehends that this information can vary between each of an instruction’s instances.
+
+  
+
+We define a subunit as an object that defines the _behavior sets_ of an instruction instance in all legal contexts (functional units, issue slots), for each subtarget.  In particular, we want to know: \
+
+
+
+
+*   What resources are shared or reserved, in what pipeline phases.
+    *   Encoding resources
+    *   Issue slot(s) used
+    *   Functional unit resources
+    *   Shared/private busses, register ports, resources, or pooled resources
+*   What registers are read and written, in which pipeline phases (ie, the instruction’s “latencies”)
+*   What additional register constraints does a functional unit instance impose on an instruction’s registers.
+
+The critical artifact generated by the MDL compiler is a set of instruction behaviors for each instruction definition.  For each subtarget, for each instruction, we generate a list of every possible behavior of that instruction on that CPU.  While this sounds daunting, in practice it's rare to have more than a few behaviors for an instruction, and most instruction definitions share their behaviors with many other instructions, across subtargets.
+
+
+## **Overview of a Processor Family Description**
+
+This document generally describes the language in a bottom up order - details first.  But let's start with a brief tops-down overview of what a processor family description looks like, without going into details about each part.
+
+A minimal processor family description has the following components:
+
+
+
+*   A set of CPU definitions - one for each subtarget.
+*   A set of functional unit template definitions,
+*   A set of subunit template definitions,
+*   A set of latency template definitions.
+
+A CPU definition specifies a set of functional unit instances that define the processor, as well as pipeline descriptions, issue slot resources, and binding of functional units to issue slots.  Each functional unit instance can be parameterized and specialized.
+
+A functional unit template specifies a set of subunits instances implemented by an instance of the functional unit.  It can be parameterized and specialized for each instance in different CPUs.
+
+A subunit template abstractly defines a set of related operations that have similar behaviors. They specify these behaviors with a set of “latency” instances.  They can also be parameterized and specialized for each instance in different functional unit templates.  Subunits tie instruction definitions both to functional units on which they can execute, and instruction behaviors described in latency templates.
+
+A latency template defines the pipeline behavior of a set of instructions.  It can be parameterized and specialized for each instance in a subunit instance.  It is also specialized for each instruction that is tied to it (through a subunit).  A latency rule, at a minimum, specifies when each operand is read and written in the execution pipeline.
+
+Here’s a very simple example of a trivial CPU, with three functional units, two issue slots, and a four-deep pipeline:
+
+
+```
+    cpu myCpu {
+    	phases cpu { E1, E2, E3, E4 };
+    issue slot1, slot2;
+    	func_unit FU_ALU my_alu1();    	// an instance of FU_ALU
+    	func_unit FU_ALU my_alu2();    	// an instance of FU_ALU
+    	func_unit FU_LOAD my_load();   	// an instance of FU_LOAD
+    }
+
+    func_unit FU_ALU() {            		// template definition for FU_ALU
+    	subunit ALU();              	// an instance of subunit ALU
+    }
+    func_unit FU_LOAD() {               	// template definition for FU_LOAD
+    	subunit LOAD();                	// an instance of subunit LOAD
+    }
+
+    subunit ALU() {                      	// template definition for ALU
+    	latency LALU();                	// an instance of latency LALU
+    }
+    subunit LOAD() {                     	// template definition for LOAD
+    	latency LLOAD();               	// an instance of latency LLOAD
+    }
+
+    latency LALU() {                     	// template definition for LALU
+    	def(E2, $dst);  use(E1, $src1);  use(E1, $src2); 
+    }
+    latency LLOAD() {                    	// template definition for LLOAD
+    	def(E4, $dst);  use(E1, $addr);
+    }
+```
+
+
+A more complete description of each part of this description is provided in the section “Defining a Processor Family”.
+
+**Defining an ISA**
+
+We need to map a microarchitecture model back to LLVM instruction, operand, and register definitions.  So, the MDL contains constructs for defining instructions, operands, registers, and register classes.  
+
+When writing a target machine description, its not necessary to write descriptions for instructions, operands, and registers - we scrape all of this information about the CPU ISA from the tablegen output as part of the build process, and produce an MDL file which contains these definitions. The machine description compiler uses these definitions to tie architectural information back to LLVM instructions, operands, and register classes.
+
+We will describe these language features here, primarily for completeness.
+
+
+### **Defining Instructions**
+
+Instruction definitions are scraped from tablegen files, and provide the following information to the MDL compiler for each instruction:
+
+
+
+*   The instruction’s name (as defined in the td files)
+*   Its operands, with the operand type and name provided in the order they are declared, and indicating whether each is an input or output of the instruction.
+*   A set of “legal” subunit definitions (a “subunit” is described later in this document)
+*   An optional list of instructions derived from this one.
+
+As in tablegen, an operand type must be either an operand name defined in the td description, a register class name defined in the td description, or simply a defined register name. If the operand type is a register name, the operand name is optional (and ignored) (these register operands are used to represent implied operands in LLVM instructions). 
+
+Grammar:
+
+
+```
+    instruction_def  : 'instruction' IDENT
+                          '(' (operand_decl (',' operand_decl)*)? ')'
+                          '{'
+                              ('subunit' '(' name_list ')' ';' )?
+                              ('derived' '(' name_list ')' ';' )?
+                          '}' ';'? ;
+    operand_decl     : ((IDENT (IDENT)?) | '...') ('(I)' | '(O)')? ;
+```
+
+
+An example:
+
+
+```
+    instruction ADDSWri(GPR32 Rd(O), GPR32sp Rn(I), addsub_shifted_imm32 imm(I), NZCV(O)) {
+      subunit(sub24,sub26);
+    }
+```
+
+
+This describes an ARM add instruction that has two defined input operands (Rn, imm), one defined output operand (Rd), and one implicit output operand (NZCV), which is associated with two subunits (sub24, sub26).
+
+
+### **Defining Operands**
+
+Operand definitions are scraped from tablegen files (like instructions), and provide the following information to the MDL compiler for each operand:
+
+
+
+*   The operand’s name,
+*   Its sub-operands, with the operand type and operand name provided in the order they are declared.  Note that operand names are optional, and if not present we would refer to these by their sub-operand id (0, 1, etc),
+*   The operand’s value type.
+
+As in LLVM, an operand definition’s sub-operand types may in turn refer to other operand definitions. (Note that operand’s sub-operands are declared with the same syntax as instruction operands.)
+
+Grammar:
+
+
+```
+    operand_def      : 'operand' IDENT
+                          '(' (operand_decl (',' operand_decl)*)? ')'
+                          '{' operand_type '}' ';'? ;
+```
+
+
+Some examples:
+
+
+```
+    operand GPR32z(GPR32 reg) { type(i32); } 
+    operand addsub_shifted_imm32(i32imm, i32imm) { type(i32); }
+```
+
+
+
+### **Defining Registers and Register Classes**
+
+Registers and register classes are scraped from tablegen output.  We provide a general method in the language to define registers and classes of registers which can reflect the registers defined in tablegen. 
+
+Grammar:
+
+
+```
+    register_def     : 'register' register_decl (',' register_decl)* ';' ;
+    register_decl    : IDENT ('[' range ']')? ;
+    register_class   : 'register_class' IDENT
+                            '{' register_decl (',' register_decl)* '}' ';'? 
+                     | 'register_class' IDENT '{' '}' ';'? ;
+```
+
+
+Examples:
+
+
+```
+    register a0, a1, a2, a3;                 // 4 registers
+    register a[4..7];                        // definition of a4, a5, a6, and a7
+
+    register_class low3 { a0, a1, a2 };      // a class of 3 registers
+    register_class high5 { a[3..7] };        // a class of a3, a4, a5, a6, and a7
+```
+
+
+The order of register definitions is generally insignificant in the current MDL - we use the register names defined in LLVM, and there’s no cases in the MDL where we depend on order.  Register “ranges”, such as “a[0..20]” are simply expanded into the discrete names of the entire range of registers.
+
+
+### **Defining Derived Operands**
+
+LLVM doesn’t necessarily provide all the information we want to capture about an instruction, so the MDL allows for defining “derived” operands with which we can associate named values.  A derived operand is essentially an alias to one or more LLVM-defined operands (or derived operands), and provides a mechanism to add arbitrary attributes to operand definitions. Derived operands also allow us to treat a set of operand types as identical in latency reference rules (so you don’t have to specify a long set of operand types for some references.)
+
+Grammar:
+
+
+```
+    derived_operand_def     : 'operand' IDENT (':' IDENT)+  ('(' ')')?
+                                  '{' (operand_type | operand_attribute)* '}' ';'? ;
+    operand_attribute_stmt  : 'attribute' IDENT '=' (snumber | tuple)
+                                ('if' ('lit' | 'address' | 'label')
+```
+
+
+
+    `                              ('[' pred_value (',' pred_value)* ']' )? )? ';' `;
+
+
+```
+    pred_value              : snumber
+                            | snumber '..' snumber
+                            | '{' number '}' ;
+	tuple                   : '[' snumber (',' snumber)* ']' ;
+```
+
+
+
+#### **Derivation**
+
+Each derived operand is declared with one or more “base” operands, for which it is an alias. Circular or ambiguous derivations are explicitly disallowed - there must be only one derivation path for a derived operand to any of its base concrete operands.
+
+Derived operands are used in place of their base operands in operand latency rules in latency templates (described later). This allows a rule to match a set of operands, rather than a single operand, and also can provide access to instruction attributes to the latency rule.
+
+
+#### **Derived operand attributes**
+
+Derived operand attributes associate name/value-tuple pairs with the operand type. Tuples are appropriate when an attribute is used as a set of masks for resource sharing, described later.  
+
+Some examples:
+
+
+```
+    attribute my_attr_a = 1;
+    attribute my_attr_b = 123;
+    attribute my_tuple  = [1, 2, 3];
+```
+
+
+Attributes can have predicates that check if the operand contains a data address, a code address, or any constant.  Additionally, attributes can have multiple definitions with different predicates, with the first “true” predicate determining the final value of the attribute for that operand instance:
+
+
+```
+    attribute my_attr = 5 if address;    // if operand is a relocatable address
+    attribute my_attr = 2 if label;      // if operand is a code address
+    attribute my_attr = 3 if lit;        // if operand is any literal constant
+```
+
+
+Predicates for literal constants can also take an optional list of “predicate values”, where each predicate value is either an integer, a range of integers, or a “mask”. Mask predicate values are explicitly checking for non-zero bits:
+
+
+```
+    attribute my_attr = 5 if lit [1, 2, 4, 8];    // looking for specific values
+    attribute my_attr = 12 if lit [100..200];     // looking for a range of values
+    attribute my_attr = 1 if lit [{0x0000FFFF}];  // looking for a 16 bit number
+    attribute my_attr = 2 if lit [{0x00FFFF00}];  // also a 16-bit number!
+    attribute my_attr = 3 if lit [1, 4, 10..14, 0x3F800000, {0xFF00FF00}]; 
+```
+
+
+Note that we explicitly don’t directly support floating point numbers: this should be done instead with specific bit patterns or masks.  This avoids problems with floating point precision and format differences across systems:
+
+
+```
+    attribute my_attr = 1 if lit [0xBF800000, 0x402DF854];   // -1.0, or pi
+    attribute my_attr = 2 if lit [{0x7FFF000}];              // +BF16 number
+```
+
+
+If all of an attribute’s predicates are “false” for an instance of an operand, the compiler recursively checks the attribute’s value in each of the operand’s bases until if finds a true predicate (or an unpredicated attribute): 
+
+
+```
+    operand i32imm() { type(i32); }   // scraped from llvm td file.
+
+    operand huge_imm : i32imm() {
+       attribute size = 3;
+    }
+    operand medium_imm : big_imm() {
+       attribute size = 2 if lit [-32768..32676];
+    }
+    operand small_imm : medium_imm() {
+       attribute size = 1 if lit [0-16];
+    }
+```
+
+
+
+#### **Derived operand attribute usage**
+
+There is currently only a single context in which instruction attributes are used directly in the machine description, as part of resource references in latency rules (see “latency\_resource\_ref”). In this context, you can specify an attribute name which provides the number of resources needed for a resource allocation, and the mask used to determine shared operand bits associated with the resource.  An example:
+
+
+```
+    … my_resource:my_size_attribute:my_mask_attribute …
+```
+
+
+This resource reference uses the attributes from the operand associated with this reference to determine how many resources to allocate, and what bits in the operand to share.
+
+
+## **Overview of Resources**
+
+Resources are used to abstractly describe hardware constructs that are used by an instruction in its execution.  They can represent:
+
+
+
+*   functional units, 
+*   issue slots, 
+*   register ports, 
+*   shared encoding bits, 
+*   or can name any hardware resource an instruction uses when it executes that could impact the instruction’s behavior (such as pipeline hazards).
+
+Its important to note that different instances of an instruction can use completely different resources depending on which functional unit, and which subtarget, it's issued on. The MDL has an explicit way to model this.
+
+The machine description provides a mechanism for defining and associating resources with the pipeline behaviors of instructions through the specialization of functional unit templates, subunit templates, and latency templates. It also allows automatic allocation of shared resources for an instruction instance from resource pools. The MDL compiler generates behavior descriptions which explicitly reference each resource (or resource pool) the instruction uses, and in what pipeline phases.  This provides a direct methodology for managing instruction issue and pipeline behaviors such as hazards.
+
+
+### Defining Resources
+
+There are a few ways that resources are defined:
+
+
+
+*   **Functional Units:** A resource is implicitly defined for every functional unit instance in a CPU definition. An instruction that executes on a particular instance will reserve that resource implicitly. 
+*   **Issue Slots: **Each CPU, or cluster of functional units in a CPU, can explicitly define a set of issue slots.  For a VLIW, these resources directly correspond to instruction encoding slots in the machine instruction word, and can be used to control which instruction slots can issue to which functional units.  For dynamically scheduled CPUs, these correspond to the width of the dynamic instruction issue. 
+*   **Named Resources** can be explicitly defined in several contexts, described below.
+*   **Ports:** Ports are functional unit resources that model a register class constraint and a set of associated resources. These are intended to model register file ports that are shared between functional units.
+
+Explicitly defined resources have scope - they can be defined globally (and apply to all CPU variants), within a CPU, within a cluster, or within a functional unit template.  Intuitively, shared resources are typically defined at higher levels in the machine description hierarchy.  Resources and ports defined within a functional unit template are replicated for each instance of that functional unit.  “Issue” resources are defined in CPU and cluster instances.
+
+Named resource definitions have the following grammar:
+
+
+```
+    resource_def            : 'resource' ('(' IDENT ')')?
+                                  resource_decl (',' resource_decl)*  ';' ;
+    resource_decl           : IDENT (':' number)? ('[' number ']')?
+                            | IDENT (':' number)? '{' name_list '}'
+                            | IDENT (':' number)? '{' group_list '}' ;
+
+    port_def                : 'port' port_decl (',' port_decl)* ';' ;
+    port_decl               : IDENT ('<' IDENT '>')? ('(' resource_refs ')')? ;
+    issue_resource          : 'issue' ('(' IDENT ')')? name_list ';' ;
+```
+
+
+
+#### Simple resource definitions
+
+The simplest resource definition is simply a comma-separated list of names:
+
+
+```
+    resource name1, name2, name3;
+```
+
+
+A resource can also have an explicit pipeline stage associated with it, indicating that the defined resources are always used in the specified pipeline phase:
+
+
+```
+    resource(E4) name1, name2;    // define resources that are always used in E4
+```
+
+
+A resource can have a set of bits associated with it. This defines a resource that can be shared between two references if the bits in an associated operand reference are identical.
+
+
+```
+    resource immediate:8;         // define a resource with 8 bits of data
+```
+
+
+
+#### Grouped resource definitions
+
+We can declare a set of named, related resources:
+
+
+```
+    resource bits     { bits_1, bits_2, bits_3 };
+```
+
+
+A resource group typically represents a pool of resources that are shared between instructions executing in parallel, where an instruction may require one or all of the resources. This is a common attribute of VLIW architectures, and used to model things like immediate pools and register ports.
+
+Any defined resource can be included in a group, and the order of the members of a group is significant when members are allocated.  If a group mentions an undefined resource (in either the current or enclosing scope), the member is declared as a resource in the current scope.  In the case above, if the members (bits\_1, etc) are not declared, the compiler would create the definition:
+
+
+```
+    resource bits_1, bits_2, bits_3;
+```
+
+
+and the group members would refer to these definitions. (Note: we don’t support nested groups).
+
+The resource group can be referenced by name, referring to the entire pool, or by individual members, such as “bits.bits\_2” to specify the use of a specific pooled resource.  Consider the following example:
+
+
+```
+    resource bits_1, bits_2, bits_3;
+    resource bits_x { bits_1, bits_2, bits_3 };
+    resource bits_y { bits_3, bits_1, bits_2 };
+```
+
+
+“bits\_x” and “bits\_y” are distinct groups that reference the same members, but members are allocated in a different order.  Groups can also be defined with syntax that indicates how its members are allocated by default.
+
+
+```
+    resource bits_or  { bits_1 | bits_2 | bits_3 };       // allocate one of these
+    resource bits_and { bits_1 & bits_2 & bits_3 };       // allocate all of these
+```
+
+
+Groups can also be implicitly defined in functional unit and subunit template instantiations as a resource parameter.
+
+
+```
+    func_unit func my_fu(bits_1 | bits_2 | bits_3);
+```
+
+
+This implicitly defines a resource group with three members, and passes that group as a parameter of the instance.
+
+
+#### Pooled resource definitions
+
+We can also declare a set of “unnamed” pooled resources:
+
+
+```
+    resource shared_bits[0..5];
+```
+
+
+This describes a resource pool with 6 members.  The entire pool can be referenced by name (ie “shared\_bits”), or each member can be referenced by index (“shared\_bits[3]”), or a subrange of members (“shared\_bits[2..3]). A resource reference can also indicate that it needs some number of resources allocated with the syntax: shared\_bits:<number>.  
+
+Resource pools can also have data associated with them, each member has its own set of bits:
+
+
+```
+    resource bits:20 { bits_1, bits_2, bits_3 };
+    resource shared_bits:5[6];
+```
+
+
+Resource pools, like resource groups, are used to model things like shared encoding bits and shared register ports, where instructions need one or more members of a set of pooled resources.
+
+Finally, resource definitions can pin a resource to a particular pipeline phase. All references to that resource will be automatically modeled only at that pipeline stage. This is particularly useful for modeling shared encoding bits (typically for resource pools).  The syntax for that looks like:
+
+
+```
+    resource(E1) my_pool { res1, res2, res3 };
+```
+
+
+where E1 is the name of a pipeline phase.  The resource “my\_pool” (and each of its elements) is always modeled to be reserved in pipeline phase E1.
+
+
+### **Using Resources**
+
+Resource references appear in several contexts.  They are used in all template instantiations to specialize architecture templates (functional units, subunit, or latency templates) and are ultimately used in latency rules to describe pipeline behaviors. These will be described later in the document.
+
+When used to specialize template instances, resource references have the following grammar:
+
+
+```
+    resource_ref            : IDENT ('[' range ']')?
+                            | IDENT '.' IDENT
+                            | IDENT '[' number ']'
+                            | IDENT ('|' IDENT)+
+                            | IDENT ('&' IDENT)+ ;
+```
+
+
+Some examples of resource uses in functional unit instantiations, subunit instantiations, latency instantiations, and latency reference rules:
+
+
+```
+some_resource           // reference a single resource or an entire group/pool    
+some_resource_pool[1]   // use a specific member from an unnamed pool.
+register_ports[6..9]    // select a subset of unnamed pooled resources.
+group.xyzzy             // select a single named item from a group.
+res1 | res2 | res3      // select one of these resources
+res6 & res7 & res8      // select all of these resources
+```
+
+
+References in latency reference rules have additional syntax to support the allocation of resources from groups and pools:
+
+
+```
+    latency_resource_ref    : resource_ref ':' number (':' IDENT)?
+                            | resource_ref ':' IDENT (':' IDENT)?
+                            | resource_ref ':' ':' IDENT
+                            | resource_ref ':' '*'
+                            | resource_ref ;
+```
+
+
+
+#### **Allocating Grouped and Pooled Resources**
+
+Latency references allow you to optionally manage allocation of pooled resources, as well as specifying the significant bits of operands whose values can be shared with other instructions.
+
+A reference of the form:
+
+
+```
+	some_resource_pool:1
+```
+
+
+indicates that a reference needs one element from a group/pooled resource associated with a latency reference. A reference of the form:
+
+
+```
+	some_resource_pool:2
+```
+
+
+indicates that the reference needs 2 (or more) _adjacent_ elements from a pooled resource associated with a latency reference.  A reference of the form:
+
+
+```
+	some_resource_pool:*
+```
+
+
+indicates that a reference needs _all _elements from a resource group or pool. Note that grouped resources can only use :1 and :\*.
+
+A reference of the form:
+
+
+```
+some_resource_pool:size
+```
+
+
+indicates an operand reference that requires some number of resources from the resource pool.   The number of resources needed is specified in the “size” attribute of the associated operand type. This enables us to decide at compile time how many resources to allocate for an instruction’s operand based on its actual value.  For example, large operand constant values may require more resources than small constants, while some operand values may not require any resources. There’s a specific syntax for describing these attributes in derived operand definitions (described earlier).
+
+In the examples above, if the resource has shared bits associated with it (it’s shareable by more than one instruction), the entire contents of the operand are shared. In some cases, only part of the operand’s representation is shared, and we can can specify that with the following reference form:
+
+	`some_resource_pool:size:mask`
+
+This indicates that the associated operand’s “mask” attribute indicates which of the operand bits are sharable.  Finally, we can use a share-bits mask without allocation:
+
+	`some_resource_pool::mask`
+
+This reference utilizes the resource - or an entire pool - and uses the operand’s “mask” attribute to determine which bits are shared with other references.
+
+We will describe how these references work when we describe latency rules.
+
+
+## **Defining a Processor Family**
+
+A TableGen description describes a family of processors, or subtargets, that share instruction and register definitions. Information about instruction behaviors are described with Schedules and Itineraries. The MDL also uses common instruction and register descriptions, scraped from TableGen, and adds first-class descriptions of CPUs, functional units, and pipeline modeling.
+
+In an MDL CPU description, a CPU is described as an explicit set of functional units.  Each functional unit is tied to a set of subunits, and subunits are in turn explicitly tied to instruction definitions and pipeline behaviors.  There are two approaches for associating subunits with functional units, and the choice of which one to use is dependent on the attributes of the architecture you’re describing:
+
+
+
+1. Subunit templates specify (either directly or through Latencies) which functional units they use, or
+2. You define functional unit templates that specify exactly which subunits they use.
+
+More detail on this below.
+
+
+### **Method 1: SuperScalar and Out-Of-Order CPUs**
+
+Fully protected pipelines, forwarding, out-of-order issue and retirement, imprecise micro-operation modeling, and dynamic functional unit allocation make this class of 
+
+CPUs difficult to model_ precisely._  However, because of their dynamic nature, precise modeling is both impossible and unnecessary.  But it is still important to provide descriptions that enable scheduling heuristics to understand the relative temporal behavior of instructions.
+
+This method is similar to the way Tablegen “Schedules” associate instructions with a set of ReadWrite resources, which are in turn associated with sets of ProcResources (or functional units), latencies and micro-operations. This approach works well for superscalar and out-of-order CPUs, and can also be used to describe scalar processors.
+
+The upside of this method is that you don’t need to explicitly declare functional unit templates.  You simply declare CPU instances of the functional units you want, and the MDL compiler creates implicit definitions for them.
+
+The downside of this method is that you can’t specialize functional unit instances, which in turn means you can’t specialize subunit instances, or associated latency instances.  Fortunately, specialization generally isn’t necessary for this class of CPUs.  It would also be difficult to use this method to describe a typical VLIW processor (which is why we have method 2!).
+
+We generally describe this as a “bottoms-up” approach (subunits explicitly tying to functional unit instances), and is the approach used by the Tablegen scraper (tdscan) for “Schedule-based” CPUs.
+
+
+### **Method 2: VLIWs, and everything else**
+
+This method is appropriate for machines where we must provide more information about the detailed behavior of an instruction so that we can correctly model its issuing and pipeline behavior. It is particularly important for machines with deep, complex pipelines that _must_ be modeled by the compiler.  It has a powerful, flexible user-defined resource scheme which provides a lot more expressiveness than either “Schedules” or “Itineraries”. 
+
+In this method, a functional unit instance is an instantiation of an _issuing_ functional unit, which is more typical of scalar and VLIW CPUs.  In the common case where different instances of a functional unit have different behaviors, we can easily model that using functional unit, subunit, and latency instance specialization, and more detailed latency rules. 
+
+This approach allows a very high degree of precision and flexibility that's not available with method 1.  Its strictly more expressive than the first method, but much of that expressiveness isn’t required by superscalar CPUs.
+
+We describe this as a “tops-down” approach (explicit functional unit template definitions
+
+assert which subunits they support).  This is the method tdscan uses when scraping information about itineraries.
+
+
+### **Schema of a Full Processor Family Description**
+
+ By convention, a description generally describes things in the following order (although the order of these definitions doesn’t matter):
+
+
+
+*   Definition of the family name.
+*   Describe the pipeline model(s).
+*   Describe each CPU (subtarget) in terms of functional unit instances.
+*   Describe each functional unit template in terms of subunit instances (tops-down approach)
+*   Describe each subunit template type. A subunit represents a class of instruction definitions with similar execution behaviors, and ties those instructions to a latency description.
+*   Describe each latency in terms of operand and resource references.
+
+We will describe each of these items in more detail.  A machine description for a target has the following general schema: (a full syntax is provided in Appendix A)
+
+
+```
+    <family name definition>
+    <pipeline phase descriptions>
+    <global resource definitions>
+    <derived operand definitions>
+
+    // Define CPUs
+    cpu gen_1 { 
+       <cpu-specific resource definitions>
+       <functional unit instance>
+       <functional unit instance>
+       …
+    } 
+    cpu gen_2 { … }
+    …
+
+	// Define Functional Unit Template Definitions (Tops-down approach)
+    func_unit a_1(<functional unit parameters>) { 
+       <functional-unit-specific resource and port definitions>
+       <subunit instance>
+	   <subunit instance>
+	   …
+}
+    func_unit b_1(…) { … } 
+    …
+
+    // Define Subunit Template Definitions
+    subunit add(<subunit parameters>) {
+       <latency instance>
+       <latency instance>
+       …
+    }
+    subunit mul(…) { … }
+    …
+
+    // Latency Template Definitions
+    latency add(<latency parameters>) {
+       <latency reference>
+       <latency reference>
+       …
+    }
+    latency mul(…) { … }
+	…
+
+    // Instruction information scraped from Tablegen description
+    <register descriptions>
+    <register class descriptions>
+    <operand descriptions>
+    <instruction descriptions> 
+```
+
+
+
+#### **Bottoms-up vs Tops-down CPU Definition Schemas \
+**
+
+In the “tops-down” schema, we define CPUs, which instantiate functional units, which instantiate subunits, which instantiate latencies.  At each level of instantiation, the object (functional unit, subunit, latency) can be specialized for the context that it’s instantiated in.  We think of this as a “top-down” definition of a processor family. We provide detailed descriptions for each functional unit template, which we can specialize for each instance.
+
+However, for many processors, this specialization is unnecessary, and the normal schema is overly verbose. For these kinds of processors, we can use the “bottoms-up” schema.
+
+In this schema, the MDL compiler _implicitly_ creates functional unit and latency templates:
+
+
+
+*   A CPU definition specifies which functional units are used in the normal syntax.
+*   Subunits directly implement latency rules inline (rather than instantiate a latency template), including an explicit functional unit instance that they can execute on.
+
+Here’s an example of this kind of bottom-up description:
+
+
+```
+    cpu dual_cpu {
+    	func_unit ALU alu1();     // a "my_alu" functional unit, named "alu1"
+    	func_unit ALU alu2();     // a "my_alu" functional unit, named "alu2"
+    }
+    subunit alu2() {{ def(E2, $dst); use(E1, $src); fus(ALU, 3); }}
+    subunit alu4() {{ def(E4, $dst); use(E1, $src); fus(ALU, 7); }}
+```
+
+
+	`subunit alu7() {{ def(E7, $dst); use(E1, $src); fus(ALU, 42); }}`
+
+Note that we don’t explicitly define the ALU functional unit template, but it is instantiated (twice) and used in three subunit/latency templates. Similarly, we don’t explicitly define the three latency templates.  Both the functional unit template and the latency templates are implicitly created in the MDL compiler. 
+
+While this schema is much more compact, neither the functional units nor the subunits/latencies can be specialized. This is an appropriate approach for scalar and superscalar processors, and is used by tdscan for CPUs that use Tablegen Schedules.
+
+
+### **Specifying the Family Name**
+
+A family name must be specified that ties the description to the LLVM name for the processor family.  It has the following grammar:
+
+
+```
+family_name        : 'family' IDENT ';' ;
+```
+
+
+
+### **Pipeline Definitions**
+
+We don’t explicitly define instruction “latencies” in the MDL. Instead, we specify when instructions’ reads and writes happen in terms of pipeline phases.  From this, we can calculate actual latencies. Rather than specify pipeline phases with numbers, we provide a way of naming pipeline stages, and refer to those stages strictly by name. A pipeline description has the following grammar:
+
+
+```
+    pipe_def           : protection? 'phases' IDENT '{' pipe_phases '}' ';'? ;
+    protection         : 'protected' | 'unprotected' | 'hard' ;
+    pipe_phases        : phase_id (',' phase_id)* ;
+    phase_id           : '#'? IDENT ('[' range ']')? ('=' number)? ;
+```
+
+
+ For example:
+
+
+```
+phases my_pipeline { fetch, decode, read1, read2, ex1, ex2, write1, write2 };
+```
+
+
+We typically define these in a global phase namespace, and they are shared between CPU definitions. All globally defined phase names must be unique. However, each CPU definition can have private pipeline definitions, and names defined locally override globally defined names.
+
+You can define more than one pipeline, and each pipeline can have the attribute “protected”, “unprotected”, or “hard”.  “Protected” is the default if none is specified.
+
+
+```
+    protected phases alu { fetch, decode, ex1, ex2 };
+    unprotected phases vector { vfetch, vdecode, vex1, vex2 };
+    hard phases branch { bfetch, bdecode, branch };
+```
+
+
+A “protected” latency describes a machine where the hardware manages latencies between register writes and reads by injecting stalls into a pipeline when reads are issued earlier than their inputs are available, or resources are oversubscribed (pipeline hazards). Most modern general purpose CPUs have protected pipelines, and in the MDL language this is the default behavior.
+
+An “unprotected” pipeline never inserts stalls for read-after-writes or pipeline hazards. In this type of pipeline, reads fetch whatever value is in the register (in the appropriate pipeline phase).  A resource conflict (hazard) results in undefined behavior (ie, the compiler must avoid hazards!). In this model, if an instruction stalls for some reason, the entire pipeline stalls. This kind of pipeline is used in several DSP architectures.
+
+A “hard” latency typically describes the behavior of branch and call instructions, whose side effect occurs at a particular pipeline phase.  The occurrence of the branch or call always happens at that pipeline phase, and the compiler must accommodate that (by inserting code in the “delay slots” of the branch/call).
+
+You can define multiple stages as a group - the following rule is equivalent to the first example above.
+
+
+    **<code>phases alu { fetch, decode, read[1..2], ex[1..2], write[1..2] };</code></strong> \
----------------
PeimingLiu wrote:

Are those xml tags intended?

https://github.com/llvm/llvm-project/pull/78002