[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

Tue Nov 29 04:02:33 PST 2016

Hi,
Though there exists lots of researches on parallelizing or scheduling
optimization passes, If you open up the time matrices of codegen(llc
-time-passes), you'll find that the most time consuming task is actually
instruction selection(40~50% of time) instead of optimization
passes(10~0%). That's why we're trying to parallelize the
(target-independent) instruction selection process in aid of JIT
compilation speed.

The instruction selector of LLVM is an interpreter that interpret the
MatcherTable which consists of bytecodes generated by TableGen. I'm
surprised to find that the structure of MatcherTable and the interpreter
seems to be suitable for parallelization. So we propose a prototype that
parallelizes the interpreting of OPC_Scope children that are possibly
time-consuming. Here is some quick overview:

We add two new opcodes: OPC_Fork and OPC_Merge. During DAG optimization
process(utils/TableGen/DAGISelMatcherOpt.cpp). OPC_Fork would be added to
the front of scope(OPC_Scope) children which fulfill following conditions:
1.  Amount of opcodes within the child exceed certain threshold(5 in
current prototype).
2. The child is reside in a sequence of continuous scope children which
length also exceed certain threshold(7 in current prototype).
For each valid sequence of scope children, an extra scope child, where
OPC_Merge is the only opcode, would be appended to it(the sequence).

In the interpreter, when an OPC_Fork is encountered inside a scope child,
the main thread would dispatch the scope child as a task to a central
thread pool, then jump to the next child. At the end of a valid "parallel
sequence(of scope children)" an OPC_Merge must exist and the main thread
would stop there and wait other threads to finish.

About the synchronization, read-write lock is mainly used: In each
checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except
OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a write lock
is used.

Finally, although the generated code is correct, total consuming time
barely break even with the original one. Possible reasons may be:
1. The original interpreter is pretty fast actually. The thread pool
dispatching time for each selection task may be too long in comparison with
the original approach.
2. X86 is the only architecture which contains OPC_CheckComplexPat that
would modify DAG. This constraint force us to add write lock on it which
would block other threads at the same time. Unfortunately,
OPC_CheckComplexPat is probably the most time-consuming opcodes in X86 and
perhaps in other architectures, too.
3. Too many threads. We're now working on another approach that use larger
region, consist of multiple scope children, for each parallel task for the
sake of reducing thread amount.
4. Popular instructions, like add or sub, contain lots of scope children so
one or several parallel regions exist. However, most of the common
instruction variants(e.g. add %reg1, %reg2) is on "top" among scope
children which would be encountered pretty early. So sometimes threads are
fired, but the correct instruction is actually immediately selected after
that. Thus lots of time is wasted on joining threads.

Here is our working repository and diff with 3.9 release: https://bitbucket.
org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff
<https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
I don't think the current state is ready for code reviewing since there is
no significant speedup. But it's very welcome for folks to discuss about
this idea and also, whether current instruction selection approach had
reached its upper bound of speed.(I ignore fast-isel by mean since it
sacrifices too much on quality of generated code. One of our goals is to
boost the compilation speed while keeping the code quality as much as
possible)

Feel free to comment directly on the repo diff above.

About the "region approach" mentioned in the third item of possible reasons
above. It's actually the "dev-region-parallel" branch, but it still has
some bugs on correctness of generated code. I would put more detail about
it if the feedback is sound.

NOTE: There seems to be some serious bugs in concurrent and synchronization
library of old gcc/standard libraries. So it's strongly recommended to use
the latest version of clang to build our work.

B.R
-- 
Bekket McClane
Department of Computer Science,
National Tsing Hua University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161129/eac7296a/attachment-0001.html>