[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Mon Oct 21 22:38:51 PDT 2019

We are going to be at the llvm-dev meeting the next two days.   We will get
back to you after that.

Sri

On Mon, Oct 21, 2019 at 10:07 PM Maksim Panchenko <maks at fb.com> wrote:

> Hi Sri,
>
>
>
> Thank you for replying to our feedback. 7 out 12 high-level concerns have
> been
>
> answered; 2 of them are fully addressed. The rest are being tracked at the
>
> following Google doc:
>
>
>
>
> https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/
>
>
>
> To keep the discussion at a high level, I have to reference the LTO vs
> ThinLTO
>
> comparison since that appears to be the central theme in your response to
> the
>
> feedback.
>
>
>
> Unlike LTO, BOLT does not have to keep the entire program in memory.
>
> Furthermore, as we have previously mentioned, most of the passes are run in
>
> parallel, and the performance scales well with the number of CPUs.
>
>
>
> To demonstrate that running BOLT on a hot subset of functions is not just a
>
> speculation, we have prototyped a "Thin" version that optimizes Clang-7 in
> under
>
> 15 seconds using less than 4GB of memory. No modifications to the linker or
>
> compiler were required. And by the way, that appears to be faster than
> just the
>
> re-linking phase of the Propeller. Larger loads show similar improvements
>
> providing 2x-5x savings over the original processing specs.
>
>
>
> Let me reiterate that current BOLT requires large amounts of memory not
> because
>
> it's a fundamental limitation, unlike LTO. For us, system memory was never
> a
>
> constraint. The runtime of the application, not BOLT, was the primary goal
>
> during the development.
>
>
>
> ThinLTO design solves a real problem and dramatically improves compilation
> time
>
> even when building on a single node. ThinLTO results provide "end-to-end
> build
>
> time" comparison to LTO. I've asked you to show a similar comparison for
>
> Propeller vs. BOLT. I haven't seen the results, and I suspect the total
> overhead
>
> will exceed that of even the oldest non-parallel version of BOLT.
>
>
>
> One argument I've heard is that BOLT is not taking advantage of the
> distributed
>
> build system. That's correct. It does not have to since it does not
> require to
>
> rebuild the application. In "Thin" mode, the overhead is similar to a
> regular
>
> linker running with a linker script.
>
>
>
> You are right that we do not support debug fission packages. It is
> unimplemented
>
> for a simple reason: no one asked for it previously. And as we like to say
> in
>
> the open-source community: "patches are welcome."
>
>
>
> Maksim
>
>
>
> P.S. We have updated https://github.com/facebookincubator/BOLT with
> instructions on running BOLT with jemalloc or tcmalloc.
>
>
>
> On 10/18/19, 11:21 AM, "Sriraman Tallam" <tmsriram at google.com> wrote:
>
>
>
> Hello Maksim,
>
>
>
> On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <maks at fb.com> wrote:
>
> Cool. The new numbers look good. If you run BOLT with jemalloc library
>
> preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed
> that
>
> compared to the default malloc, it improves the multithreaded
>
> performance and brings down memory usage significantly.
>
>
>
> Great, thanks for confirming!  Would you be willing to share specific
> numbers, how significant is the reduction in memory with jemalloc for
> clang?    We double-checked our numbers with the larger benchmarks and we
> can confirm they were *not built with labels*.  One of our
> large benchmarks, search1, is about 5 times the size of clang in terms of
> text size as reported by size command, and we are seeing a 70G memory
> overhead on this. Do you have  data on the memory consumption of BOLT with
> larger benchmarks with jemalloc.   We are trying to build Chrome with
> latest BOLT so that we can share the memory overheads and the binaries with
> you for transparency but we are struggling with the disassembly errors. If
> you have data on large benchmarks we would appreciate it if you could share
> it.
>
>
>
> Further, if you have a recipe to use jemalloc with BOLT, please point it
> at us. We could try it out too and share our findings.
>
>
>
> Thanks much,
>
> Sri
>
>
>
>
>
> Thanks,
>
> Maksim
>
>
>
> On 10/17/19, 2:59 PM, "Sriraman Tallam" <tmsriram at google.com> wrote:
>
>
>
>
>
> On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <maks at fb.com> wrote:
>
> Hi Sri,
>
>
>
> I want to clarify one thing before sending a detailed reply: did you
> evaluate
>
> BOLT on Clang built with basic block sections?
>
> In the makefile you reference,
>
> there are two versions: a “vanilla” and a default built with function
> sections.
>
> High overheads you see with BOLT on Clang do not match our experience.
>
>
>
> Thanks for pointing that out in the Makefile. We double-checked and
> noticed a bug in our Makefile.  For clang, we noticed that we are BOLTING
> with basic block symbols which seems to affect the memory consumption of
> BOLT.  So, we  have re-measured with recent bolt and for *full
> transparency* we have uploaded the binaries,  BOLT's yaml files and
> perf.data files  and the commands so that you can independently verify our
> results and check the binaries.  We have gzipped all the files to keep it
> under 2G limit for git lfs, everything is here :
> https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment
> We have run our experiments on a 192G machine with Intel 18 core.
>
>
>
> We built llvm-bolt with most recent sources and is *pristine* with none of
> our patches and uploading the binary we used here,
> https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt
> That's a very recent BOLT binary, git
> hash: 988a7e8819b882fd14e18d149f8d3f702b134680
>
>
>
> The
> https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains
> two sets of binaries.  The first binary is pristine recent clang-10 built
> from 2 weeks ago with additionally only -Wl,-q.  v2 is another clang binary
> also only additionally built with -q.  There are no labels or sections or
> any other Propeller flags used to build these clang binaries.  Here is the
> command we are using to optimize with BOLT, all the commands have been
> checked in too.
>
>
>
> You should be able to run llvm-bolt now on these binaries as all the files
> are provided.  We have also provided the raw perf data files in case you
> want to independently convert.
>
>
>
> $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
> pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
> -reorder-functions=hfsort -relocs=1 --update-debug-sections
>
>
>
> For version 2, this is the number:
>
>
>
> Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40
>
> Maximum resident set size (kbytes): 18742688
>
>
>
> That is 125 seconds and ~18G of RAM.
>
>
>
> For version 1, this hangs and we stopped it after several minutes and the
> maximum RSS size crossing 50G.  This is most likely just a bug and you
> should be able to reproduce this.  The binary seems to be running and
> printing update messages.
>
>
>
> We also measured without update-debug-sections too with the command :
>
>
>
> $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
> pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
> -reorder-functions=hfsort -relocs=1
>
>
>
> For version1 :
>
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74
>
> Maximum resident set size (kbytes): 14824444
>
>
>
> 93 seconds and ~14G of RAM
>
>
>
> version 2 :
>
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90
>
> Maximum resident set size (kbytes): 14511912
>
>
>
> similar 91 secs and ~14G
>
>
>
> Now, coming back to the bug in the Makefile, we originally reported ~35G.
> That is *wrong* since the clang binary used to measure bolt overheads was
> built with basic block labels.  Our  *sincere apologies* for this, this
> showed BOLT as consuming more memory than is actual for clang.  We
> double-checked BOLT numbers with the internal benchmark search2 for sanity
> and that is built *without any labels* and only with "-Wl,-q".  We are
> checking the other large internal benchmarks too.  We cannot disclose
> internal benchmarks. So, we will get more large open-source benchmarks like
> Chrome or gcc built with bolt and share the binaries and results so you can
> independently verify.
>
>
>
> Thanks
>
> Sri
>
>
>
>
>
> Thanks,
>
> Maksim
>
>
>
> On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via
> llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> Hello,
>
>
>
> I wanted to consolidate all the discussions and our final thoughts on the
> concerns raised.  I have attached a document consolidating it.
>
>
>
> BOLT’s performance gains inspired this work and we believe BOLT
>
> is a great piece of engineering.  However, there are build environments
> where
> scalability is critical and memory limits per process are tight :
>
> * Debug Fission,  https://gcc.gnu.org/wiki/DebugFission was primarily
> invented to achieve scalability and better incremental build times while
> building large binaries with debug information.
>
> * ThinLTO,
> http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=>
> was
> primarily invented to make LLVM’s full LTO scalable and keep the memory
> and
> time overheads low.  ThinLTO has enabled much broader adoption of whole
> program optimization, by making it non-monolithic.
>
> * For Chromium builds,
>
> https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=>
> in/concurrent_links.gni, the linker process memory is set to 10GB with
> ThinLTO.
> It was 26GB with Full LTO before that and individual processes will run of
> out
> of memory beyond that.
>
> * Here,
>
> https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=>
> utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google
> scale
> is shown where 5 million binary and test builds are performed every day on
> several thousands of machines, each  with a limitation of 12G of memory
> per
> process and 15 minute time-out on tests. Memory overheads of 35G (clang)
> are
> well above these thresholds.
>
> We have developed Propeller like ThinLTO that can be used to obtain
> similar
> performance gains like BOLT in such environments.
>
>
>
> Thanks
>
> Sri
>
>
>
>
>
> On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
>
>
> On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Is there large value from deferring the block ordering to link time? That
> is, does the block layout algorithm need to consider global layout issues
> when deciding which blocks to put together and which to relegate to the
> far-away part of the code?
>
>
>
> Or, could the propellor-optimized compile step instead split each function
> into only 2 pieces -- one containing an "optimally-ordered" set of hot
> blocks from the function, and another containing the cold blocks? The
> linker would have less flexibility in placement, but maybe it doesn't
> actually need that flexibility?
>
>
>
> Apologies if this is obvious for those who actually know what they're
> talking about here. :)
>
>
>
> It is a fair question.
>
>
>
> We believe the flexibility to do fine grained layout in whole program
> context is important. PostLinkOptimization is aimed at getting as much
> performance improvement as possible (usually applied on top of
> ThinLTO+PGO), so the framework is designed to enable it.
>
>
>
> In particular, it allows the linker to stitch hot bb traces from different
> functions to be stitched together. It also allows hot trace duplication
> across procedure boundaries (kind of interprocedural tailDup). Besides,
> code alignment decisions to minimize branch mispredictions  may require
> global context (e.g, too conflicting branches residing in two different
> functions).  Other micro-arch specific optimizations to improve processor
> front-end throughput may also require global context.
>
>
>
> It is conceivable to have an option to control the level of granularity at
> the possible cost of performance.
>
>
>
> thanks,
>
>
>
> David
>
>
>
>
>
>
>
> On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote:
>
> You’re correct, except that, in Propeller, CFI duplication happens for
> every basic block as it operates with the conservative assumption that a
> block can be put anywhere by the linker. That’s a significant bloat that is
> not cleaned up later. So, during link time, if N blocks from the same
> function are contiguous in the final layout, as it should happen most of
> the time for any sane BB order, we would have several FDEs for a region
> that only needs one. The bloat goes to the final binary (a lot more FDEs,
> specifically, one FDE per basic block).
>
> BOLT will only split a function in two parts, and only if it has profile.
> Most of the time, a function is not split. It also has an option not to
> split at all. For internally reordered basic blocks of a given function, it
> has CFI deduplication logic (it will interpret and build the CFI states for
> each block and rewrite the CFIs in a way that uses the minimum number of
> instructions to encode the states for each block).
>
>
>
> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y
> Knight via llvm-dev <llvm-dev at lists.llvm.org>
> *Reply-To: *James Y Knight <jyknight at google.com>
> *Date: *Wednesday, October 2, 2019 at 1:59 PM
> *To: *Maksim Panchenko <maks at fb.com>
> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>
> *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
>
>
> I'm a bit confused by this subthread -- doesn't BOLT have the exact same
> CFI bloat issue? From my cursory reading of the propellor doc, the CFI
> duplication is _necessary_ to represent discontiguous functions, not
> anything particular to the way Propellor happens to generate those
> discontiguous functions.
>
>
>
> And emitting discontiguous functions is a fundamental goal of this, right?
>
>
>
> On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Thanks for clarifying. This means once you move to the next basic block
> (or any other basic
>
> block in the function) you have to execute an entirely new set of CFI
> instructions
>
> except for the common CIE part. While indeed this is not as bad, on
> average, the overall
>
> active memory footprint will increase.
>
>
>
> Creating one FDE per basic block means that .eh_frame_hdr, an allocatable
> section,
>
> will be bloated too. This will increase the FDE lookup time. I don’t see
> .eh_frame_hdr
>
> being mentioned in the proposal.
>
>
>
> Maksim
>
>
>
> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com>
> wrote:
>
>
>
>
>
>
>
> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> *Pessimization/overhead for stack unwinding used by system-wide profilers
> and
> for exception handling*
>
> Larger CFI programs put an extra burden on unwinding at runtime as more CFI
> (and thus native) instructions have to be executed. This will cause more
> overhead for any profiler that records stack traces, and, as you correctly
> note
> in the proposal, for any program that heavily uses exceptions.
>
>
>
> The number of CFI instructions that have to be executed when unwinding any
> given stack stays the same. The CFI instructions for a function have to be
> duplicated in every basic block section, but when performing unwinding only
> one such a set is executed -- the copy for the current basic block.
> However, this copy contains precisely the same CFI instructions as the ones
> that would have to be executed if there were no basic block sections.
>
>
>
> --
>
> Krzysztof Pszeniczny
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191021/bc9bcfb5/attachment.html>