[LLVMdev] [global-isel] Proposal for a global instruction selector

Jakob Stoklund Olesen stoklund at 2pi.dk
Thu Aug 8 16:18:01 PDT 2013


I am hoping that this proposal will generate a lot of feedback, and there are many different topics to discuss. When replying to this email, please change the subject header to something more specific, but keep the [global-isel] tag.

Thanks,
/jakob

Proposal for a Global Instruction Selector

It is becoming evident that we need a replacement for the SelectionDAG-based instruction selector. This proposal describes the architecture of a global instruction selector based on the MachineInstr intermediate representation.

A new instruction selector is a large project that will probably take a couple of years to complete. This proposal only describes the desired final design, it does not describe the path of incremental development we should follow to get there.

This is a strawman design which shouldn't be seen as a final proposal. I am hoping for a lot of community feedback, and I don't expect we'll have a final design before we actually begin implementing it.

Why not legalize LLVM IR?

It's been proposed to legalize LLVM IR before instruction selection, see http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/061567.html. I recommend reading the mailing list thread debating the proposal, but let me summarize the reasons for legalizing MI instead of LLVM IR:

ABI boundaries have a too high-level representation in LLVM IR. By ABI boundaries, I mean function call and return, landing pads, and inline assembly.

LLVM IR doesn't make it clear which function arguments go in registers and which go on the stack. Stack frame and call frame maintenance instructions are not represented in LLVM IR. Byval arguments are a big mystery, and some arguments are actually passed in multiple registers, depending on their type.

While LLVM IR could be extended to represent all these things, it would pollute the clean design of LLVM IR, and it is not the right level of abstraction for this IR.

The getelementpointer instruction is too abstract. We want pointer arithmetic to be exposed as normal integer arithmetic in a type system without pointers. This would require a lot of awkward pointer-to-integer conversions in LLVM IR.

The LLVM IR type system is too high-level. We would need to get rid of first-class aggregates and pointers. This is possible, but it would require lots of bitcasts and other noisy instructions that are really no-ops as far as instruction selection is concerned.

We want LLVM IR to be readable by future versions of the compiler, and we desire that target authors can rearrange their design as the target evolves. Requiring target authors to autoupgrade old idioms (e.g., the equivalent of X86ISD nodes) isn't a reasonable thing.

Motivation

Everybody loves to hate SelectionDAG, but it is still useful to make its shortcomings explicit. These are some of the goals for a new instruction selector architecture:

We want a global instruction selector.

SelectionDAG operates on a basic block at a time, and we have been forced to implement a number of hacks to work around that. For example, most of CodeGenPrepare is moving instructions around to make good local instruction selection more likely. Legalization of switches and selects must be done either before or after instruction selection because it requires creating new basic blocks.

A number of passes running after instruction selection are also mostly about cleaning up after the single-basic-block selector. This includes MachineCSE, MachineLICM, and the peephole pass.

We want a faster instruction selector.

The SelectionDAG process is quite heavyweight because it uses continuous CSE, a whole new IR, and a mandatory scheduling phase to linearize the DAG. By selecting directly to MI, we can avoid one IR translation phase. By using a linearized IR, scheduling becomes optional.

We want a shared code path for fast and good instruction selection.

Currently, the fast instruction selector used for -O0 builds is a completely separate code path. This is not healthy because it increases the likelihood of bugs in the fast path that were not present in the slow path.

It would be better if the -O0 instruction selector were a trimmed down version of the full instruction selector.

We want an IR that represents ISA concepts better.

The SelectionDAG IR is very good at representing LLVM IR directly, but as the code is lowered to model target machine concepts, weird hacks are often required. This is evident in the way too many SDNodes required to represent a function call, or the many custom ISD nodes that targets need to define.

In many cases, custom target code knows exactly which instructions it wants to produce, and the IR should make it possible and easy to just emit the desired instructions directly. The MI intermediate representation isn't perfect either, and we should plan some MI improvements as well.

The SelectionDAG concept of legal types and their mapping to a single register class often causes problems. In some cases, it is necessary to lie about value types, just to get the instruction selector to do the right thing.

We want a more configurable instruction selector.

Weird targets have weird requirements, and it should be possible for targets to inject new passes into the instruction selection process. Sometimes, it may even be required to replace a standard pass.

Overall design

The global instruction selector is implemented as a series of machine function passes. Targets can inject their own passes or replace standard passes, using the existing mechanism for configuring code generation pipelines.

The MI intermediate representation is extended with a set of abstract target-independent opcodes very similar to the ISD opcodes used by SelectionDAG. Virtual registers can be typed so they have an EVT instead of a register class.

The standard passes are:

MI builder.
Switch lowering.
Iterative legalization and cleanup.
Register bank selection.
Common subexpression elimination.
Instruction selection.
I'll describe these passes in more detail below.

MI builder

The MI builder pass translates LLVM IR to MI IR, much like the current SelectionDAGBuilder phase. Like SelectionDAGBuilder, this pass will:

Expand LLVM IR first-class aggregate types into their constituent parts. This also includes splitting load, store, phi, and select instructions.
Replace pointer types with target-specific integers.
Expand getelementptr instructions with integer adds and multiplies.
Map LLVM instructions to target-independent abstract MI opcodes.
Lower ABI boundaries with help from target hooks.
Unlike SelectionDAGBuilder, this pass will:

Create a 1-1 mapping of the LLVM IR CFG. No new basic blocks are created.
Preserve switch instructions.
Preserve i1 logic instructions.
Not use MERGE_VALUES instructions. We'll use a value map that can handle aggregates.
The aggregate type expansion creates value types that can all be represented by EVTs, and MachineRegisterInfo will be extended to allow virtual registers to have an EVT instead of a register class. EVTs are all the integer, floating point, and vector types from LLVM IR.

The ABI boundary lowering requires types to be broken down further into 'legal types' that can be mapped to registers. The secondary breakdown is currently handled by TargetLowering::LowerCallTo() calling getRegisterType() and getNumRegisters(). Most ABIs are defined in terms of C types, not LLVM IR types, so there is a close connection between the C frontend and the ABI lowering code in the instruction selector. It would be a good idea to have the ABI lowering code work independently of the type system used during instruction selection. This is made possible by having ABI lowering be part of the LLVM IR to MI translation process.

Switch lowering

SelectionDAGBuilder is lowering switches immediately because the CFG can't be changed during instruction selection. This isn't required with a global instruction selector, so switch lowering can be its own pass, simplifying the design of the MI builder.

The switch lowering pass converts switch instructions into a combination of branch trees and jump tables. A default target-independent implementation is possible, but targets may want to override this pass. For example, the ARM target could try to create jump tables that would work well with the TBB/TBH instructions to reduce code size.

It is convenient to lower switches in the MI intermediate representation because it already has an object type for jump tables.

Legalization

SelectionDAG's concept of legal types isn't clearly defined. It seems to be a bit random which operations must be supported before a type can be considered legal. We'll define legal types precisely:

A type is considered legal if it can be loaded into and stored from an allocatable register class. (And all allocatable register classes must support copy instructions.)

We don't require other supported operations than load and store for a type to be legal, and all types that can be loaded and stored are legal. This means that most targets will have a much larger set of legal types than they do today. Also note that we don't require targets to designate a register class to use for each legal type; in fact, TableGen can compute the legal type set automatically. Register classes can be inferred from the selected instructions,MachineRegisterInfo::recomputeRegClass() already knows how to do that.

With a much larger set of legal types, a separate type legalization phase becomes superfluous. The operation legalizer must be able to do anything the type legalizer can do anyway, so the type legalizer isn't adding any value.

The legalizer will work bottom-up and iteratively. As illegal instructions are visited, the legalizer will apply transformations that make the current instruction 'more legal'. Each transformation is allowed to modify a whole chain of single-use instructions for efficiency, but it is also allowed to only modify the current instruction in hard cases. The instructions created don't need to be legal, the legalizer iterates until the current instruction is legal before it moves on.

The set of new instructions created while legalizing a single instruction is fed through an instruction simplifier that cleans up redundancies. This replaces DAGCombine.

Clearly, more details are required here, and community input would be much appreciated. Some specific questions that need answering are:

What does 'more legal' mean? Can we restrict the possible legalization transformations so the iterative process is guaranteed to make progress?

A long time ago, SelectionDAG had a single legalization pass. It was broken into multiple passes because the code got too complicated. Is the iterative approach proposed here good enough to keep the code complexity under control?

In my opinion, it's the value mapping that gets you. If the legalization can be broken into small well-defined transformations with no shared state, such as a value map, then there is no reason for the code to become complicated.

Compare this to the live range splitting implemented in the register allocator. Splitting a live range can be fairly complicated, but it is implemented as a transaction that transforms valid IR to valid IR. We no longer have the side maps with deferred spill code and so on; there is no shared state between live range splits other than the IR. This makes the problem manageable.

DAGCombine is supposed to clean up legalization products, but today it seems to be accumulating some transformations that probably belong in InstCombine. Since it is running so late in the pass pipeline, its canonicalizing approach is causing problems for targets that are trying to arrange the IR optimally. Optimal and canonical are not always the same thing.

Can we get away with a gentler instruction simplifier that is more focused at cleaning up after legalization?

Does it still work if we constrain the instruction simplifier to creating legal operations?

Register bank selection

Many instruction set architectures have multiple register banks. X86 has three: Integer, vector, and x87 floating point registers. (Four if you count MMX registers as a separate bank.) Blackfin and m68k have separate pointer and data register banks, etc. It is also common to have the same operations available in multiple register banks. For example, most ISAs with a vector register bank support bitwise and/or/xor operations in both the integer and vector register banks.

SelectionDAG is using the type system to select register banks; each legal type is mapped to a single register class. This mapping is causing a lot of strange modeling problems in targets, and it often leads to spurious cross-class copies with values bouncing back and forth between register banks. See for example the strange occurrences of v1i64 in the ARM target. It is also very difficult to make efficient use of operations that are available in multiple register banks.

The global instruction selector will assign virtual registers to register banks explicitly, not by using the type system. The set of register banks is typically small (2-3) and defined by the target. Modelling register banks explicitly makes it possible to move operations between register banks to minimize cross-bank copies which are often expensive. SPARC even requires cross-bank copies to go through memory, as does x86 in some cases.

The register bank selection pass computes the optimal bank assignment and inserts copy instructions when a value needs to cross banks. Sometimes, it may be beneficial to have the same value available in two register banks simultaneously, this can also be represented with cross-bank copy instructions. The bank selection can also be affected by register pressure concerns. On x86-64, for example, many i32 values could be moved to the SSE registers, freeing up the integer registers.

The interaction between legalization and bank selection is not clear yet. For example, i64 would be a legal type on ARM with and/or/xor/add/sub operations supported in a D-register. Some i64 operations are not supported in D-registers, and it would be necessary to expand those i64 operations to i32 operations in the GPR bank. When that happens, it is most likely preferable to 'legalize' more connected i64 operations so they can be moved to the GPR bank as well.

Instruction selection

The instruction selection pass replaces most target-independent instructions with real target opcodes, and it ensures that all virtual registers have register classes instead of EVTs. Some target-independent instructions, like COPY, are not lowered until after register allocation.

SelectionDAG instruction selection is controlled only by expression types, and the selected instructions are expected to use the unique register banks selected by the type:

(operation, type) --> opcode
We're representing register banks explicitly, and many operations are available in multiple banks, so the register bank needs to be part of the instruction selection:

(operation, type, bank) --> opcode
On the other hand, when types are not used to select register banks, it becomes really difficult to explain the difference between load i32 and load f32. The hardware doesn't care either, it simply knows how to load 32 bits into a given register. We can use a three-level hierarchical type system to better describe this:

Bit width. Operations like load, store, select, and the bitwise and/or/xor only depend on the number of bits in the operands. Their effect is independent of the vector lane structure and whether lanes are ints or floats.

Vector lanes + lane width. Operations like shufflevector and insertelement depend of the vector topology of the operand types, but they don't care if vector lanes are floats or ints.

Full EVTs. Finally, arithmetic instructions actually depend on the full type of their operands. It is worth noting, though, that the int/float distinction is already redundantly encoded in the operations. LLVM IR has separate add and fadd instructions, for example.

The instruction selection will only use the relevant parts of the operand type, depending on the operation being matched. It will consider load i32, load f32, and load v2i16 to be simply 32-bit wide loads. The target is likely to have multiple 32-bit load instructions. The explicit register bank is used to pick the right one.

Apart from the way operations and types are matched, the instruction selection algorithm is a lot like the current SelectionDAG algorithm. The process is less transactional than it is in SelectionDAG. Specifically, targets are allowed to insert pre-selected instructions at any time.

The instruction selection algorithm is global, which means it can look upwards in the dominator tree when matching patterns. A cost model is required to determine when it is a good idea to fold instructions outside the current loop. The same applies to matching instructions with multiple uses.

Ramifications

This proposal goes a lot further than simply implementing the same instruction selector on a new intermediate representation. It's worthwhile to point out some of the effects of the proposed design.

More legal types

The new definition of a legal type is very permissive, and there are going to be many more legal types in most targets. It is also worth noting that the legality of a type is a function of the type's bit size only. In other words, if f64 is a legal type, so is i64, v2f32, and even v64i1. On the ARM target, for example, these types would be legal:

All 8-bit types via ldrb/strb to GPR. (i8, v1i8, v2i4, v4i2, v8i1)
All 16-bit types via ldrh/strh to GPR. (i16, f16, v1i16, v2i8, ...)
All 32-bit types via ldr/str to GPR and vldr/vstr to SPR.
All 64-bit types via ldrd/strd to GPRPair and vldr/vstr to DPR.
All 128-bit types via vld1/vst1 to DPair.
All 192-bit types via vld1/vst1 to DTriple.
All 256-bit types via vld1/vst1 to DQuad.
This larger set of legal types also makes it easier to handle things like extractelement <8 x i8> which currently produces an illegal type and is thus obfuscated by the type legalizer.

An i8 live range could exist in an ARM function as long as the value is only loaded and stored. We could even use a register class with a single-byte spill size. Different spill sizes for the same registers is already supported. That is how the x86 target holds 32-bit floats in the 128-bit xmm registers. If the i8 value has any illegal arithmetic instructions, the legalizer will expand it to i32, and the expansion should probably be propagated to other basic blocks and threaded through phis.

Most of CodeGenPrepare can go away

The instruction selector will be able to match patterns across basic blocks by looking up the dominator tree. This means that CodeGenPrepare no longer needs to sink GEPs and compares down to their uses. CodeGenPrepare has a lot of addressing mode matching code which can simply be deleted.

Other transformations that currently live in CodeGenPrepare could probably be moved to the MI representation, although it isn't quite clear if we would want to do that.

Better modeling of ISA concepts

Try compiling this function for SPARC64:

float f(int *p) {
  return *p;
}
It's a simple load and float-to-int conversion which produces this IR:

define float @f(i32* nocapture %p) {
entry:
  %l = load i32* %p, align 4
  %conv = sitofp i32 %l to float
  ret float %conv
}
SelectionDAG generates this SPARC64 assembly:

ld [%i0], %l0
st %l0, [%fp+2043]
ld [%fp+2043], %f0
fitos %f0, %f1
The integer value %l is first loaded into an integer register (%l0) and then bitcasted to a float so it can be used by the fitos instruction which reads an integer in%f0 and writes its float result to %f1. The SPARC target is essentially creating a DAG representing this IR:

define float @f(i32* nocapture %p) {
entry:
  %l = load i32* %p, align 4
  %l2 = bitcast i32 %l to float
  %conv = sitofp float %l2 to float
  ret float %conv
}
The bitcast instruction really represents a cross-bank register copy, and the sitofp instruction no longer typechecks; it really does read an integer from a floating point register, but SelectionDAG has no way of representing that. The type labels are no longer representing types, they are purely used to select register banks. The %l2 is not a float, and it isn't being interpreted as a float by the fitos instruction. It's an integer in the floating point register bank.

By modeling register banks better, we don't have to lie to the instruction selector or depend on its type checking being defunct, and we get better results for free. Initially, the MI builder would produce code like this, with default register banks labels derived from value types:

define float @f(i32* nocapture %p) {
entry:
  %l:IntBank = load i32* %p, align 4
  %conv:FloatBank = sitofp i32 %l to float
  ret float %conv
}
The legalizer recognizes that SPARC only supports sitofp in the FloatBank register banks, so it inserts a cross-bank copy:

define float @f(i32* nocapture %p) {
entry:
  %l:IntBank = load i32* %p, align 4
  %l2:FloatBank = COPY %l
  %conv:FloatBank = sitofp i32 %l2 to float
  ret float %conv
}
Finally, the bank selection pass can eliminate the cross-bank copy by moving the %l value to the floating point register bank:

define float @f(i32* nocapture %p) {
entry:
  %l:FloatBank = load i32* %p, align 4
  %conv:FloatBank = sitofp i32 %l to float
  ret float %conv
}
Now we have an i32 load into the floating point register bank which is fine, the instruction selector simply has a pattern that matches any 32-bit load targeting the floating point bank, and we get this code:

ld [%i0], %f0
fitos %f0, %f1
More tuned targets like ARM already has peephole patterns that handle code like this, but it shouldn't be necessary to manually fix all these cases. The ISA fully supports integers in the 'floating point' register banks, so the instruction selector should be able to model that as a first class concept.

Fragile legal types

The new model allows legal types with very few legal operations, and this creates extra challenges for the legalizer. Some legal types are 'fragile' in the sense that it can be beneficial to split even legal operations to avoid too many cross bank copies. Consider this ARM program:

void dot(uint64_t *a, uint64_t *p, uint64_t *q, unsigned n) {
  uint64_t s = 0;
  while (n--)
    s += *p++ * *q++;
  *a = s;
}
which produces this IR after the MI builder has expanded GEPs: (I am borrowing LLVM IR syntax to describe the MI IR.)

define void @dot(i32 %a, i32 %p, i32 %q, i32 %n) {
entry:
  %sum0:i64,VecBank = const i64 0
  %cmp1 = icmp eq i32 %n, 0
  br i1 %cmp1, label %exit, label %loop

loop:
  %iv:i32,IntBank = phi i32 [ %iv2, %loop ], [ %n, %entry ]
  %p1:i32,IntBank = phi i32 [ %p2, %loop ], [ %p, %entry ]
  %q1:i32,IntBank = phi i32 [ %q2, %loop ], [ %q, %entry ]
  %sum1:i64,VecBank = phi i64 [ %sum2, %loop ], [ %sum0, %entry ]
  %iv2:i32,IntBank = add i32 %iv, -1
  %p2:i32,IntBank = add i32 %p1, i32 8
  %q2:i32,IntBank = add i32 %q1, i32 8
  %lp:i64,VecBank = load %p1, align 4
  %lq:i64,VecBank = load %q1, align 4
  %prod:i64,VecBank = mul i64 %lp, %lq
  %sum2:i64,VecBank = add i64 %prod, %sum1
  %cmp2 = icmp eq i32 %iv, 0
  br i1 %cmp2, label %exit, label %entry

exit:
  %s:i64,VecBank = phi i64 [ %sum0, %entry ], [ %sum2, %loop ]
  store i64 %s, i32 %a, align 4
  ret void
}
These instructions are all legal on ARM, except for the mul i64 instruction which the legalizer will split and move to the integer register bank:

define void @dot(i32 %a, i32 %p, i32 %q, i32 %n) {
entry:
  %sum0:i64,VecBank = const i64 0
  %cmp1 = icmp eq i32 %n, 0
  br i1 %cmp1, label %exit, label %loop

loop:
  %iv:i32,IntBank = phi i32 [ %iv2, %loop ], [ %n, %entry ]
  %p1:i32,IntBank = phi i32 [ %p2, %loop ], [ %p, %entry ]
  %q1:i32,IntBank = phi i32 [ %q2, %loop ], [ %q, %entry ]
  %sum1:i64,VecBank = phi i64 [ %sum2, %loop ], [ %sum0, %entry ]
  %iv2:i32,IntBank = add i32 %iv, -1
  %p2:i32,IntBank = add i32 %p1, i32 8
  %q2:i32,IntBank = add i32 %q1, i32 8
  %lp:i64,VecBank = load %p1, align 4
  %lq:i64,VecBank = load %q1, align 4

  %lpl:i32,IntBank = extract_element i64 %lp, 0
  %lph:i32,IntBank = extract_element i64 %lp, 1
  %lql:i32,IntBank = extract_element i64 %lq, 0
  %lqh:i32,IntBank = extract_element i64 %lq, 1

  %prodl:i32,IntBank, %prodh0:i32,IntBank = umul_lohi i32 %lpl, %lql
  %prodh1:i32,IntBank = mul i32 %lpl, %lqh
  %prodh2:i32,IntBank = mul i32 %lph, %lql
  %prodh3:i32,IntBank = add i32 %prodh0, %prodh1
  %prodh:i32,IntBank = add i32 %prodh2, %prodh3

  %prod:i64,VecBank = build_pair %prodl, %prodh

  %sum2:i64,VecBank = add i64 %prod, %sum1
  %cmp2 = icmp eq i32 %iv, 0
  br i1 %cmp2, label %exit, label %entry

exit:
  %s:i64,VecBank = phi i64 [ %sum0, %entry ], [ %sum2, %loop ]
  store i64 %s, i32 %a, align 4
  ret void
}
All operations are now legal in their respective register banks, and an -O0 build would probably stop here. However, this code is quite expensive to execute because of the many cross-bank copies, and the bank selector pass will clean that up. First, the two 64-bit loads are moved to the integer register bank where their values are needed. Second, the legal add i64 operation is split so it can also be moved to the integer register bank, and the split is threaded through the%sum1 phi to eliminate all cross-bank copies:

define void @dot(i32 %a, i32 %p, i32 %q, i32 %n) {
entry:
  %sum0l:i32,IntBank = const i32 0
  %sum0h:i32,IntBank = const i32 0
  %cmp1 = icmp eq i32 %n, 0
  br i1 %cmp1, label %exit, label %loop

loop:
  %iv:i32,IntBank = phi i32 [ %iv2, %loop ], [ %n, %entry ]
  %p1:i32,IntBank = phi i32 [ %p2, %loop ], [ %p, %entry ]
  %q1:i32,IntBank = phi i32 [ %q2, %loop ], [ %q, %entry ]
  %sum1l:i32,IntBank = phi i32 [ %sum2l, %loop ], [ %sum0l, %entry ]
  %sum1h:i32,IntBank = phi i32 [ %sum2h, %loop ], [ %sum0h, %entry ]
  %iv2:i32,IntBank = add i32 %iv, -1
  %p2:i32,IntBank = add i32 %p1, i32 8
  %q2:i32,IntBank = add i32 %q1, i32 8
  %lp:i64,IntBank = load %p1, align 4
  %lq:i64,IntBank = load %q1, align 4

  %lpl:i32,IntBank = extract_element i64 %lp, 0
  %lph:i32,IntBank = extract_element i64 %lp, 1
  %lql:i32,IntBank = extract_element i64 %lq, 0
  %lqh:i32,IntBank = extract_element i64 %lq, 1

  %prodl:i32,IntBank, %prodh0:i32,IntBank = umul_lohi i32 %lpl, %lql
  %prodh1:i32,IntBank = mul i32 %lpl, %lqh
  %prodh2:i32,IntBank = mul i32 %lph, %lql
  %prodh3:i32,IntBank = add i32 %prodh0, %prodh1
  %prodh:i32,IntBank = add i32 %prodh2, %prodh3

  %sum2l:i32,IntBank, carry = addc i32 %prodl, %sum1l
  %sum2h:i32,IntBank = adde %prodh, %sum1h

  %cmp2 = icmp eq i32 %iv, 0
  br i1 %cmp2, label %exit, label %entry

exit:
  %sl:i32,IntBank = phi i32 [ %sum0l, %entry ], [ %sum2l, %loop ]
  %sh:i32,IntBank = phi i32 [ %sum0h, %entry ], [ %sum2h, %loop ]
  %s:i64,IntBank = build_pair %sl, %sh
  store i64 %s, i32 %a, align 4
  ret void
}
The bank selection pass will need to weigh the cost of an extra adde instruction against the cost of a cross-bank copy. Sometimes, like in this example, it is preferable to split legal operations. On the other hand, if it weren't for the multiplication, the entire function would run in the vector register bank. A single instruction can cause entire expression webs to migrate.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130808/76eb7d06/attachment.html>


More information about the llvm-dev mailing list