[LLVMdev] [LLVM Dev] [Discussion] Function-based parallel LLVM backend code generation

Tue Jul 16 19:48:37 PDT 2013

-----Original Message-----
From: Xinliang David Li [mailto:xinliangli at gmail.com] 
Sent: Wednesday, July 17, 2013 4:18 AM
To: Wan, Xiaofei
Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu)
Subject: Re: [LLVMdev] [LLVM Dev] [Discussion] Function-based parallel LLVM backend code generation

On Tue, Jul 16, 2013 at 3:33 AM, Wan, Xiaofei <xiaofei.wan at intel.com> wrote:
> Hi, community:
>
> For the sake of our business need, I want to enable "Function-based parallel code generation" to boost up the compilation of single module, please see the details of the design and provide your feedbacks on below aspects, thanks!
> 1. Is this idea the proper solution for my requirement 2. This new 
> feature will be enabled by llc -thd=N and has no impact on original 
> llc when -thd=1 3. Can this new feature of llc be accepted by 
> community and merged into LLVM code tree
>
> Patches
> The patch is divided into four separated parts, the all-in-one patch could be found here:
> http://llvm-reviews.chandlerc.com/D1152
>
> Design
> https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgj
> Y-vhyfySg/edit?usp=sharing
>
>
> Background
> 1. Our business need to compile C/C++ source files into LLVM IR and link them into a big BC file; the big BC file is then compiled into binary code on different arch/target devices.
> 2. Backend code generation is a time-consuming activity happened on target device which makes it an important user experience.
> 3. Make -j or file based parallelism can't help here since there is only one big BC file; function-based parallel LLVM backend code generation is a good solution to improve compilation time which will fully utilize multi-cores.
>
> Overall design strategy and goal
> 1. Generate totally same binary as what single thread output 2. No 
> impacts on single thread performance & conformance 3. Little impacts 
> on LLVM code infrastructure
>
> Current status and test result
> 1. Parallel llc can generate same code as single thread by "objdump 
> -d", it could pass 10 hours stress test for all performance benchmark 
> 2. Parallel llc can introduce ~2.9X performance gain on XEON sever for 
> 4 threads

Ignoring FE time which can be fully parallelized and assuming 10% compile time is spent in serial module passes, 25% time is spent in CGSCC pass, the maximum speed up that can be gained by using function level parallelism is less than 3x.  Even adding support for parallel compilation for leaves of CG in CGSCC pass won't help too much -- the percentage of leaf functions is < 30% in large apps I have seen.

Module based parallelism proposed by Shuxin has max speed up of 10x, assuming body cloning does not add a lot overhead and build farm with hundred/thousands of nodes is used.

[Xiaofei] for SpecCPU2006, I got the data function passes consume >90% of total time in llc by vtune (I don't enable LTO); here I only consider llc without LTO, the max parallelism depends how many threads are started. 

David

>
>
> Thanks
> Wan Xiaofei
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>