[llvm] [LLVM][MDL] First integration of MDL with LLVM (PR #78002)

Thu Dec 12 13:00:50 PST 2024

================
@@ -0,0 +1,2483 @@
+
+# MPACT Microarchitecture Description Language
+
+Reid Tatge          [tatge at google.com](mailto:tatge at google.com)
+
+
+## **Goals for a Machine Description Language**
+
+Modern processors are complex: multiple execution pipelines, dynamically dispatched, out-of-order execution, register renaming, forwarding networks, and (often) undocumented micro-operations. Instruction behaviors, including micro-operations, often can’t be _statically_ modeled in an accurate way, but only _statistically_ modeled. In these cases, the compiler’s model of a microarchitecture (Schedules and Itineraries in LLVM) is effectively closer to a heuristic than a formal model. And this works quite well for general purpose microprocessors.
+
+However, modern accelerators have different and/or additional dimensions of complexity: VLIW instruction issue, unprotected pipelines, tensor/vector ALUs, software-managed memory hierarchies. And it's more critical that compilers can precisely model the details of that complexity. Currently, LLVM’s Schedules and Itineraries aren’t adequate for directly modeling many accelerator architectural features.
+
+So we have several goals:
+
+1. We want a first-class, purpose-built, intuitive language that captures all the scheduling and latency details of the architecture - much like Schedules and Itineraries - that works well for all current targets, but also for a large class of accelerator architectures..
+2. The complexity of the specification should scale with the complexity of the hardware. 
+3. The description should be succinct, avoiding duplicated information, while reflecting the way things are defined in a hardware architecture specification.
+4. We want to generate artifacts that can be used in a machine-independent way for back-end optimization, register allocation, instruction scheduling, etc - anything that depends on the behavior and constraints of instructions.
+5. We want to support a much larger class of architectures in one uniform manner.
+
+For this document (and language), the term “instructions” refers to the documented instruction set of the machine, as represented by LLVM instructions descriptions, rather than undocumented micro-operations used by many modern microprocessors. 
+
+The process of compiling a processor’s machine description creates several primary artifacts:
+
+*   For each target instruction (described in td files), we create an object that describes the detailed behaviors of the instruction in any legal context (for example, on any functional unit, on any processor)
+*   A set of methods with machine independent APIs that leverage the information associated with instructions to inform and guide back-end optimization passes.  
+
+The details of the artifacts are described later in this document.
+
+_Note: A full language grammar description is provided in an appendix.  Snippets of grammar throughout the document only provide the pertinent section of the grammar, see the Appendix A for the full grammar._
+
+The proposed language can be thought of as an _optional extension to the LLVM machine description_. For most upstream architectures, the new language offers minimal benefit other than a much more succinct way to specify the architecture vs Schedules and Itineraries.  But for accelerator-class architectures, it provides a level of detail and capability not available in the existing tablegen approaches.
+
+### **Background**
+
+Processor families evolve over time. They accrete new instructions, and pipelines change - often in subtle ways - as they accumulate more functional units and registers; encoding rules change; issue rules change. Understanding, encoding, and using all of this information - over time, for many subtargets - can be daunting.  When the description language isn’t sufficient to model the architecture, the back-end modeling evolves towards heuristics, and leads to performance issues or bugs in the compiler. And it certainly ends with large amounts of target specific code to handle “special cases”. 
+
+LLVM uses the [TableGen](https://llvm.org/docs/TableGen/index.html) language to describe a processor, and this is quite sufficient for handling most general purpose architectures - there are 20+ processor families currently upstreamed in LLVM! In fact, it is very good at modeling instruction definitions, register classes, and calling conventions.  However, there are “features” of modern accelerator micro-architectures which are difficult or impossible to model in tablegen.
+
+We would like to easily handle:
+
+*   Complex pipeline behaviors
+    *   An instruction may have different latencies, resource usage, and/or register constraints on different functional units or different operand values.
+    *   An instruction may read source registers more than once (in different pipeline phases).
+    *   Pipeline structure, depth, hazards, scoreboarding, and protection may differ between family members.
+*   Functional units
+    *   Managing functional unit behavior differences across subtargets of a family.
+    *   Impose different register constraints on instructions (local register files, for example).
+    *   Share execution resources with other functional units (such as register ports)
+    *   Functional unit clusters with separate execution pipelines.
+*   VLIW Architecture 
+    *   issue rules can get extremely complex, and can be dependent on encoding, operand features, and pipeline behavior of candidate instructions.
+
+More generally, we’d like specific language to:
+
+*   Support all members of a processor family
+*   Describe CPU features, parameterized by subtarget
+    *   Functional units
+    *   
+    *   Pipeline structure and behaviors
+
+Since our emphasis is on easily supporting accelerators and VLIW processors, in addition to supporting all existing targets, much of this is overkill for most upstreamed CPUs.  CPU’s typically have much simpler descriptions, and don’t require much of the capability of our machine description language.  Incidentally, MDL descriptions of these targets (generated automatically from the tablegen Schedules and Itineraries) are typically much more concise than the original tablegen descriptions.
+
+### **Approach - “Subunits” and Instruction Behaviors**
+
+We developed a DSL that allows us to describe an arbitrary processor microarchitecture in terms that reflect what is typically documented in the hardware specification. The MDL compiler creates a database that provides microarchitecture behavior information that can _automatically_ inform critical back-end compiler passes, such as instruction scheduling and register allocation, in a machine-independent way. 
+
+It’s important to note the difference between an instruction definition, as described in LLVM, and an instruction instance.  Generally, instructions defined in LLVM share the same behaviors across all instances of that instruction in a single subtarget. Exceptions to this require non-trivial code in the back-end to model variant behavior.  In VLIW and accelerator architectures, each generated instance of an instruction can have different behaviors, depending on how it's issued, its operand values, the functional unit it runs on, and the subtarget. So we provide a way to model those differences in reasonable ways.
+
+The MDL introduces the concept of a “subunit” to abstractly represent a class of instructions with the same behaviors. Subunit instances concretely connect instructions to descriptions of their behaviors, _and_ to the functional units that they can be issued on. A subunit is vaguely analogous to collections of SchedRead and SchedWrite resources. 
+
+Naively, we could create unique subunits for each behavior for each instruction, the set of which would enumerate the cross-product of the instruction’s behaviors on every subtarget, functional unit, and issue slot. But subunits can be specialized by subtarget, functional unit, and each instruction definition, so a single subunit definition can properly describe behaviors for sets of instructions in many different contexts.
+
+A key aspect of this language design is that we can explicitly represent the potentially polymorphic behavior of each generated instance of any instruction, on any functional unit, on any subtarget.  The representation also comprehends that this information can vary between each of an instruction’s instances.
+
+We define a subunit as an object that defines the _behavior sets_ of an instruction instance in all legal contexts (functional units, issue slots), for each subtarget.  In particular, we want to know:
+
+*   What resources are shared or reserved, in what pipeline phases.
+    *   Encoding resources
+    *   Issue slot(s) used
+    *   Functional unit resources
+    *   Shared/private busses, register ports, resources, or pooled resources
+*   What registers are read and written, in which pipeline phases (ie, the instruction’s “latencies”)
+*   What additional register constraints does a functional unit instance impose on an instruction’s registers.
+
+The critical artifact generated by the MDL compiler is a set of instruction behaviors for each instruction definition.  For each subtarget, for each instruction, we generate a list of every possible behavior of that instruction on that CPU.  While this sounds daunting, in practice it's rare to have more than a few behaviors for an instruction, and most instruction definitions share their behaviors with many other instructions, across subtargets.
+
+## **Overview of a Processor Family Description**
+
+This document generally describes the language in a bottom up order - details first.  But let's start with a brief tops-down overview of what a processor family description looks like, without going into details about each part.
+
+A minimal processor family description has the following components:
+
+*   A set of CPU definitions - one for each subtarget.
+*   A set of functional unit template definitions,
+*   A set of subunit template definitions,
+*   A set of latency template definitions.
+
+A CPU definition specifies a set of functional unit instances that define the processor, as well as pipeline descriptions, issue slot resources, and binding of functional units to issue slots.  Each functional unit instance can be parameterized and specialized.
+
+A functional unit template specifies a set of subunits instances implemented by an instance of the functional unit.  It can be parameterized and specialized for each instance in different CPUs.
+
+A subunit template abstractly defines a set of related operations that have similar behaviors. They specify these behaviors with a set of “latency” instances.  They can also be parameterized and specialized for each instance in different functional unit templates.  Subunits tie instruction definitions both to functional units on which they can execute, and instruction behaviors described in latency templates.
+
+A latency template defines the pipeline behavior of a set of instructions.  It can be parameterized and specialized for each instance in a subunit instance.  It is also specialized for each instruction that is tied to it (through a subunit).  A latency rule, at a minimum, specifies when each operand is read and written in the execution pipeline.
+
+Here’s a very simple example of a trivial CPU, with three functional units, two issue slots, and a four-deep pipeline:
+
+```
+    cpu myCpu {
+    	phases cpu { E1, E2, E3, E4 };
+        issue slot1, slot2;
+    	func_unit FU_ALU my_alu1();    	// an instance of FU_ALU
+    	func_unit FU_ALU my_alu2();    	// an instance of FU_ALU
+    	func_unit FU_LOAD my_load();   	// an instance of FU_LOAD
+    }
+
+    func_unit FU_ALU() {            		// template definition for FU_ALU
+    	subunit ALU();              	// an instance of subunit ALU
+    }
+    func_unit FU_LOAD() {               	// template definition for FU_LOAD
+    	subunit LOAD();                	// an instance of subunit LOAD
+    }
+
+    subunit ALU() {                      	// template definition for ALU
+    	latency LALU();                	// an instance of latency LALU
+    }
+    subunit LOAD() {                     	// template definition for LOAD
+    	latency LLOAD();               	// an instance of latency LLOAD
+    }
+
+    latency LALU() {                     	// template definition for LALU
+    	def(E2, $dst);  use(E1, $src1);  use(E1, $src2); 
+    }
+    latency LLOAD() {                    	// template definition for LLOAD
+    	def(E4, $dst);  use(E1, $addr);
+    }
+```
+
+A more complete description of each part of this description is provided in the section “Defining a Processor Family”.
+
+**Defining an ISA**
+
+We need to map a microarchitecture model back to LLVM instruction, operand, and register definitions.  So, the MDL contains constructs for defining instructions, operands, registers, and register classes.  
+
+When writing a target machine description, its not necessary to write descriptions for instructions, operands, and registers - we scrape all of this information about the CPU ISA from the tablegen output as part of the build process, and produce an MDL file which contains these definitions. The machine description compiler uses these definitions to tie architectural information back to LLVM instructions, operands, and register classes.
+
+We will describe these language features here, primarily for completeness.
+
+### **Defining Instructions**
+
+Instruction definitions are scraped from tablegen files, and provide the following information to the MDL compiler for each instruction:
+
+*   The instruction’s name (as defined in the td files)
+*   Its operands, with the operand type and name provided in the order they are declared, and indicating whether each is an input or output of the instruction.
+*   A set of “legal” subunit definitions (a “subunit” is described later in this document)
+*   An optional list of instructions derived from this one.
----------------
reidtatge wrote:

Actually, its was only used by our internal (TPU) processor as a hack to inherit information from a base instruction.  I don't believe any upstream targets use it.  They all have pseudo instructions, but done have the base_instr field.  

So really its kind of obsolescent. 

https://github.com/llvm/llvm-project/pull/78002