[llvm-dev] [RFC] Parallelizing (Target-Independent) Instruction Selection

Fri Dec 2 17:39:33 PST 2016

On 11/29/2016 05:01 PM, Bekket McClane via llvm-dev wrote:
>
>> Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> 於 
>> 2016年11月30日 上午5:16 寫道：
>>
>>>
>>> On Nov 29, 2016, at 1:14 PM, Mehdi Amini <mehdi.amini at apple.com 
>>> <mailto:mehdi.amini at apple.com>> wrote:
>>>
>>>
>>>> On Nov 29, 2016, at 4:02 AM, Bekket McClane via llvm-dev 
>>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>
>>>> Hi,
>>>> Though there exists lots of researches on parallelizing or 
>>>> scheduling optimization passes, If you open up the time matrices of 
>>>> codegen(llc -time-passes), you'll find that the most time consuming 
>>>> task is actually instruction selection(40~50% of time) instead of 
>>>> optimization passes(10~0%). That's why we're trying to parallelize 
>>>> the (target-independent) instruction selection process in aid of 
>>>> JIT compilation speed.
>>>
>>>
>>> How much of this 40-50% is spent in the matcher table? I though most 
>>> of the overhead was inherent to SelectionDAG?
>>> Also why having such a fine grain approach instead of trying to 
>>> perform instruction selection in parallel across basic blocks or 
>>> functions?
>>>
>>> I suspect you won’t gain much for too much added complexity with 
>>> this approach.
>>
>> I forgot to add: did you try to enable fast-isel instead? In the 
>> context of a JIT this is a quite common approach.
>
> Well, as I mentioned at the bottom of my first letter, one of our goal 
> is to boost the compilation speed while keeping the quality of 
> generated code as much as possible. And fast-isel doesn't perform 
> really well on quality of generated code.
> Perhaps users of fast-isel would use multi-tier compilation model and 
> take fast-isel as baseline compiler I guess.
JFYI, if you haven't tested the quality of FastIsel code on your 
platform, do.  FastIsel frequently generates quite decent code for some 
of the architectures we support.  I've heard of folks using it for high 
tier JITs.  (I don't personally do so, but it's on my list of things to 
re-evaluate at some point.)
>
> B.R.
> McClane
>>
>> —
>> Mehdi
>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> The instruction selector of LLVM is an interpreter that interpret 
>>>> the MatcherTable which consists of bytecodes generated by TableGen. 
>>>> I'm surprised to find that the structure of MatcherTable and the 
>>>> interpreter seems to be suitable for parallelization. So we propose 
>>>> a prototype that parallelizes the interpreting of OPC_Scope 
>>>> children that are possibly time-consuming. Here is some quick overview:
>>>>
>>>> We add two new opcodes: OPC_Fork and OPC_Merge. During DAG 
>>>> optimization process(utils/TableGen/DAGISelMatcherOpt.cpp). 
>>>> OPC_Fork would be added to the front of scope(OPC_Scope) children 
>>>> which fulfill following conditions:
>>>> 1.  Amount of opcodes within the child exceed certain threshold(5 
>>>> in current prototype).
>>>> 2. The child is reside in a sequence of continuous scope children 
>>>> which length also exceed certain threshold(7 in current prototype).
>>>> For each valid sequence of scope children, an extra scope child, 
>>>> where OPC_Merge is the only opcode, would be appended to it(the 
>>>> sequence).
>>>>
>>>> In the interpreter, when an OPC_Fork is encountered inside a scope 
>>>> child, the main thread would dispatch the scope child as a task to 
>>>> a central thread pool, then jump to the next child. At the end of a 
>>>> valid "parallel sequence(of scope children)" an OPC_Merge must 
>>>> exist and the main thread would stop there and wait other threads 
>>>> to finish.
>>>>
>>>> About the synchronization, read-write lock is mainly used: In each 
>>>> checking-style opcode(e.g. OPC_CheckSame, OPC_CheckType, except 
>>>> OPC_CheckComplexPat) handlers, a read lock is used, otherwise, a 
>>>> write lock is used.
>>>>
>>>> Finally, although the generated code is correct, total consuming 
>>>> time barely break even with the original one. Possible reasons may be:
>>>> 1. The original interpreter is pretty fast actually. The thread 
>>>> pool dispatching time for each selection task may be too long in 
>>>> comparison with the original approach.
>>>> 2. X86 is the only architecture which contains OPC_CheckComplexPat 
>>>> that would modify DAG. This constraint force us to add write lock 
>>>> on it which would block other threads at the same time. 
>>>> Unfortunately, OPC_CheckComplexPat is probably the most 
>>>> time-consuming opcodes in X86 and perhaps in other architectures, too.
>>>> 3. Too many threads. We're now working on another approach that use 
>>>> larger region, consist of multiple scope children, for each 
>>>> parallel task for the sake of reducing thread amount.
>>>> 4. Popular instructions, like add or sub, contain lots of scope 
>>>> children so one or several parallel regions exist. However, most of 
>>>> the common instruction variants(e.g. add %reg1, %reg2) is on "top" 
>>>> among scope children which would be encountered pretty early. So 
>>>> sometimes threads are fired, but the correct instruction is 
>>>> actually immediately selected after that. Thus lots of time is 
>>>> wasted on joining threads.
>>>>
>>>> Here is our working repository and diff with 3.9 release: 
>>>> https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff 
>>>> <https://bitbucket.org/mshockwave/hydra-llvm/branches/compare/master%0D3.9-origin#diff>
>>>> I don't think the current state is ready for code reviewing since 
>>>> there is no significant speedup. But it's very welcome for folks to 
>>>> discuss about this idea and also, whether current instruction 
>>>> selection approach had reached its upper bound of speed.(I ignore 
>>>> fast-isel by mean since it sacrifices too much on quality of 
>>>> generated code. One of our goals is to boost the compilation speed 
>>>> while keeping the code quality as much as possible)
>>>>
>>>> Feel free to comment directly on the repo diff above.
>>>>
>>>> About the "region approach" mentioned in the third item of possible 
>>>> reasons above. It's actually the "dev-region-parallel" branch, but 
>>>> it still has some bugs on correctness of generated code. I would 
>>>> put more detail about it if the feedback is sound.
>>>>
>>>> NOTE: There seems to be some serious bugs in concurrent and 
>>>> synchronization library of old gcc/standard libraries. So it's 
>>>> strongly recommended to use the latest version of clang to build 
>>>> our work.
>>>>
>>>> B.R
>>>> --
>>>> Bekket McClane
>>>> Department of Computer Science,
>>>> National Tsing Hua University
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161202/8d598bcb/attachment.html>