[PATCH] Fast-ISel state machine

Fri Apr 17 09:49:29 PDT 2015

Hi all

I’ve been working on improving fast-isel coverage.  Our current fast-isel model involves auto-generating a bunch of C++ code from tablegen, but then hand writing a significant proportion in C++ to get coverage and performance.

I’ve ported the state machine used by selection DAG to fast-isel.  This is able to walk IR in much the same way that the SD machine walks nodes, and produces MIs.  This is able to handle predicates, transforms, and complex patterns, all of which are not handled by the current fast-isel tablegen emitter.

There are a few different pieces of this work:
(1) Extend tablegen SDNode to take the IR ValueID of the thing we are matching
(2) Extend tablegen PatFrag to take fast-isel versions of the predicate code and the transform code
(3) Teach the tablegen DAG emitter to use these IR constructs where available, and when emitting for fast-isel
(4) The state machine itself, which is just a port of the SD one, but with SDValue->Value* and a bunch of other changes like handling register class constraining.
(5) Porting the complex patterns, predicates, and transforms from SD to fast-isel.  This is mostly target specific code in the targets own FastISel.cpp file, and td files.

As my test case, i took a bitcode which contains llc itself compiled for AArch64.  All the target specific work i’ve done here is for AArch64.  It involves writing about 600 LOC for the complex patterns, and about 300 LOC to handle predicates/transforms.  This is vs the 5100 LOC AArch64FastISel.cpp currently takes.

To measure performance, i tried to see what it would take to get from SD, all the way to the currently extremely good AArch64 fast-isel implementation, and see how this new code could help us either get there quicker, or even improve what we have.

The metrics are:
- Time to run ISel
- Number of machine instrs printed by asm-printer
- BBs selected entirely by fast-isel
- Number of instrs fast-isel selected

And the runs I considered were:
(a) Stock SelectionDAG.  This is prior to anyone trying to write or run fast-isel
(b) Basic fast-isel (i.e., calls selectOperator), and has no hand-written fast-isel code
(c) Basic fast-isel + hand written code for return and branch (this is 300 LOC on AArch64)
(d) The new stack machine then the above code (this is about 900 LOC in addition)
(e) Current fast-isel without the new stack machine (this is 5100 LOC in tree currently)
(f) Current fast-isel falling back to the stack machine when it fails

Time to run ISel:
(a) 27.6
(b) 25.0
(c) 20.0
(d) 14.1
(e) 7.3
(f) 7.7

Number of machine instrs printed by asm-printer:
(a) 1912570
(b) 4685108
(c) 4321009
(d) 4598457
(e) 4230855
(f) 4231056

BBs selected entirely by fast-isel:
(a) N/A
(b) 63794
(c) 122225
(d) 266476
(e) 329909
(f) 330010

Number of instrs fast-isel selected:
(a) N/A
(b) 292623
(c) 638551
(d) 1389476
(e) 1471200
(f) 1474131

Apologies if there’s a better way to present that.  I don’t want to be presumptuous and put a spreadsheet not everyone can open in an email.

The interesting points to take away are that going from (c) to (d), we move from a backend with basic fast-isel support, to the new one.  This results in compile time in ISel dropping 30%, an increase in # instructions generated (i’m investigating this), over 2x the number of BBs entirely handled in fast-isel, and over 2x the number of instructions generated by fast-isel.

The number of BBs selected entirely is where almost all the compile time improvement comes from.  Fast-ISel gets the biggest wins in compile time when we never fall back to SelectionDAG.  Given that this patch improves fully selected BBs by 2x, its not surprising to see about 2x from compile time as a side-effect.

(e) to (f) is also interesting.  This is what happens if AArch64 uses the current path, but then adds the state machine as a fall-back.  We select about 100 more BBs in fast-isel, and about 3000 more instructions, but compile time actually regresses a little.  I haven’t yet spent much time tuning the state machine so i think i can recover this loss.  More importantly though, the state machine is optional and the backend doesn’t have to call it if it doesn’t want to.  So the code owner can make the call as to whether its worth it or not.  With a less tuned implementation that AArch64, its likely still a win to use the new code as a fallback.

So, from here i’d like to see if I can get this code in tree.  The code is entirely optional.  No backends have to change.  If no-one calls the code then it’ll be dead stripped, although we need at least one user at some point or its just dead code.

The AArch64 changes are a demonstration and its up to the code owners there if they want this or not.  The changes to TargetSelectionDAG.td and tablegen itself are necessary for this to work on any other targets.  I’m happy to discuss what the changes are in more detail, and whatever pieces people are happy with being landed (see the changes to ‘def fma’ for some of the more controversial tablegen fixes to get this to work).

Comments welcome.

Cheers,
Pete

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fast-isel-state-machine.diff
Type: application/octet-stream
Size: 107854 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150417/35caa5e5/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fast-isel-aarch64.diff
Type: application/octet-stream
Size: 50189 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150417/35caa5e5/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fast-isel-tablegen.diff
Type: application/octet-stream
Size: 47736 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150417/35caa5e5/attachment-0002.obj>