<div dir="ltr"><div dir="ltr"><span id="gmail-m_387428305123375892gmail-docs-internal-guid-cf3c1e3e-7fff-7423-ec53-b84ca96a1064"><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><font color="#000000" face="Arial"><span style="font-size:14.6667px;white-space:pre-wrap">Hereby, we discuss our plan for handling Intel's JCC mitigation as follows.</span></font></p><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><font color="#000000" face="Arial"><span style="font-size:14.6667px;white-space:pre-wrap"><br></span></font></p><p dir="ltr" style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="background-color:transparent;color:rgb(0,0,0);font-family:Arial;font-size:11pt;white-space:pre-wrap">TLDR;  By computing basic block groupings early, the compiler can form larger clusters of basic blocks (each cluster in a section) which will allow Propeller to just reuse the assembler’s mitigation.  Our experiments show that when JCC mitigation causes only 0.2% slowdown for Propeller, compared to the 0.6% slowdown incurred for the vanilla configuration.</span><br></p><br><p dir="ltr" style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">A slightly longer summary:</span></p><br><ul style="margin-top:0px;margin-bottom:0px"><li style="list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">We evaluated a Propeller prototype to reuse the existing assembler mitigation in llvm, -mbranches-within-32B-boundary, which currently uses only NOPs for mitigation.</span></p></li><li style="list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">With some changes, Propeller is able to reuse the existing assembler mitigation. To do this, we form large basic block clusters (sections containing multiple basic blocks) in the compiler by computing the basic block layout earlier.</span></p></li><li style="list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">Vanilla clang benchmark (no Propeller) regresses by ~0.6% with this flag.</span></p></li><li style="list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">With Propeller, the exact same flag regresses clang only by ~0.2%, reducing the total speedup from 7.8% to 7.6%.</span></p></li><li style="list-style-type:disc;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p style="line-height:1.38;text-align:justify;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">For similar problems, the solution is most optimally implemented in the linker. However, for this particular problem, it appears that the assembler's mitigation is good enough when combined with Propeller.</span></p></li></ul><div style="text-align:justify"><span id="gmail-m_387428305123375892gmail-docs-internal-guid-25f83311-7fff-283d-32d5-885c1d4f8044"><h1 dir="ltr" style="line-height:1.38;margin-top:20pt;margin-bottom:6pt"><span style="font-size:20pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Background</span></h1><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">The </span><a href="https://www.intel.com/content/www/us/en/support/articles/000055650/processors.html" style="text-decoration-line:none" target="_blank"><span style="font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">JCC erratum</span></a><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> is a CPU bug affecting Skylake processors which results in unpredictable behaviour under complex micro-architectural states involving the Decoded I-cache, specifically, when executing branches which cross a cache line.</span></p><br><h2 dir="ltr" style="line-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style="font-size:16pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">MicroCode Update (MCU) Mitigation</span></h2><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">The CPU avoids this bug by bypassing the Decoded ICache for branches crossing 32B boundaries. This sacrifices some performance (0-4%) in return for correctness. The compiler can alleviate this effect by aligning the code such that branches do not cross a 32B boundary. There are two ways that the compiler can do this:</span></p><ol style="margin-top:0px;margin-bottom:0px"><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">Inserting NOP instructions</span></p></li><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">Inserting prefixes for instructions</span></p></li></ol><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">The current solution shipped with clang-10 (under </span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">-mbranches-within-32B-boundary</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">) aligns every function at 32B and uses NOPs between instructions. Our experiment shows enabling this option results in 0.6% performance degradation for Clang. There have been some efforts to improve this using instruction prefixes (</span><a href="https://reviews.llvm.org/D72225" style="text-decoration-line:none" target="_blank"><span style="font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">https://reviews.llvm.org/D72225</span></a><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">, </span><a href="https://reviews.llvm.org/D75268" style="text-decoration-line:none" target="_blank"><span style="font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">https://reviews.llvm.org/D75268</span></a><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">) even though there has been some uncertainty about the available headroom (</span><a href="https://reviews.llvm.org/D72225#1818149" style="text-decoration-line:none" target="_blank"><span style="font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">https://reviews.llvm.org/D72225#1818149</span></a><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">).</span></p><br><h1 dir="ltr" style="line-height:1.38;margin-top:20pt;margin-bottom:6pt"><span style="font-size:20pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">JCC Mitigation in Propeller</span></h1><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Propeller modifies the code layout by emitting basic blocks into sections and reordering them at link time. This means the assembler’s mitigation could be corrupted by Propeller.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">There are two ways in which Propeller can solve the problem:</span></p><br><ol style="margin-top:0px;margin-bottom:0px"><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">Redo the full mitigation in the linker</span></p></li><li dir="ltr" style="list-style-type:decimal;font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline">Reuse the mitigation that is being implemented in the assembler</span></p></li></ol><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Next we discuss each of the two strategies in more detail.</span></p><h2 dir="ltr" style="line-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style="font-size:16pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Full Mitigation in the Linker</span></h2><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">The current compiler solution is implemented in the assembler backend and its scope is limited to one function at a time (with </span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">-function-sections</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">), which requires excessive alignment of 32B for the function entry.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">As a post-link optimization infrastructure, Propeller has the global view of all sections in the link time and is at a better position for global optimal JCC mitigation. The challenge for Propeller is finding the location of affected branch instructions, and inserting paddings or prefixes at the right places (some instructions cannot be prepended with prefixes or NOPs). This is easier for the assembler as it has higher-level information about instructions and can use the MC layer structures (such as MCRelaxableFragment) to emit variable-sized paddings or prefixes. </span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span></p><p style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><font color="#000000" face="Arial"><span style="font-size:14.6667px;white-space:pre-wrap">As we discuss next, our prototype relying on the assembler's mitigation incurs no significant overhead and therefore we do not plan to address this problem in the linker.</span></font></p><h2 dir="ltr" style="line-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style="font-size:16pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Relying on the Assembler’s Mitigation</span></h2><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Propeller can use the assembler’s mitigation on every basic block section. However, this means every basic block would be aligned at 32 bytes. The paddings between the basic blocks may be executed nops which will put significant pressure on the CPU's frontend.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">To reduce the NOP paddings, we would need to emit BB sections at a coarser level of granularity, which would mean emitting multiple basic blocks in the same section. However, currently, Propeller delays the basic block layout computation until link time and hence the actual group of basic blocks (cluster) is only available at link time.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">To make this work, we implemented a prototype by moving the layout computation before the final round of Propeller compilation. After the layout is computed, basic block partitions of each function are extracted and passed to the compiler.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">For example, consider the following BB layout for a program consisting of two functions foo (with 5 basic blocks) and bar (with a single basic block).</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(244,204,204);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(244,204,204);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo.BB.1</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(244,204,204);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo.BB.2</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(147,196,125);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">bar</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(207,226,243);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo.BB.3</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(207,226,243);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo.BB.4</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">The extracted BB partitions are as follows: </span></p><p dir="ltr" style="line-height:1.38;text-indent:36pt;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo: {  [</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(244,204,204);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo, foo.BB.1, foo.BB.2</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">] , [</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(207,226,243);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo.BB.3, foo.BB.4</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">] }</span></p><p dir="ltr" style="line-height:1.38;text-indent:36pt;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Bar: { [</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:rgb(182,215,168);font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">bar</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">] }</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">We instruct the compiler to emit </span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">’s basic blocks in two sections and </span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">bar</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">’s single basic block in one section. The assembler applies JCC mitigation on each of the three sections by aligning them at 32 bytes and inserting minimal paddings between instructions within every section. The only change compared to the baseline mitigation with</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> -function-sections</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> is emitting an excessive 32 bytes alignment for </span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">foo.BB.3</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">. However, the introduced padding is non-executed code (may have small pressure on the instruction cache and TLB).</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">We note that the layout algorithm would scatter a function’s basic blocks across multiple partitions judiciously and only if it is advantageous for the performance. For intra-procedural layout, only two clusters are created (hot and cold). Nonetheless, the non-executed paddings for clusters will have minimal impact on performance.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">On another note, better code layout could reduce the overhead of JCC mitigation because the hot code would be packed together and the paddings for the cold blocks will not affect the hot code.</span></p><h2 dir="ltr" style="line-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style="font-size:16pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Results</span></h2><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">We evaluated Clang’s performance under different optimizations with and without JCC mitigation. We used PGO + ThinLTO for all configurations. We tested two propeller code layouts: inter-procedural, and intra-procedural. The intra-procedural results in at most two clusters for every function, while the inter-procedural layout could lead to more.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">To use JCC mitigation, we use “-Wl,-mllvm,--x86-branches-within-32B-boundaries -mbranches-within-32B-boundaries".</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">We ran the clang bootstrap test 10 times for each configuration and measured the average cpu time (user + sys in seconds).</span></p><br><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">We note that our evaluation is performed on a machine </span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:700;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">without</span><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> the microcode update installed.</span></span></div><div style="text-align:justify"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span></div><div style="text-align:justify"><table cellspacing="0" cellpadding="0" dir="ltr" border="1" style="table-layout:fixed;font-size:10pt;font-family:Arial;width:0px;border-collapse:collapse;border:none"><colgroup><col width="160"><col width="127"><col width="121"></colgroup><tbody><tr style="height:21px"><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;border:1px solid rgb(204,204,204)"></td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;border:1px solid rgb(204,204,204)">Mitigation Enabled</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;border:1px solid rgb(204,204,204)">Mitigation Disabled</td></tr><tr style="height:21px"><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;border:1px solid rgb(204,204,204)">baseline (PGO + ThinLTO)</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;font-size:11pt;font-weight:bold;color:rgb(0,0,0);text-align:right;border:1px solid rgb(204,204,204)">545.362</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;font-size:11pt;font-weight:bold;color:rgb(0,0,0);text-align:right;border:1px solid rgb(204,204,204)">542.012</td></tr><tr style="height:21px"><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;border:1px solid rgb(204,204,204)">Propeller intra-proedural</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;font-size:11pt;font-weight:bold;color:rgb(0,0,0);text-align:right;border:1px solid rgb(204,204,204)">506.828</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;font-size:11pt;font-weight:bold;color:rgb(0,0,0);text-align:right;border:1px solid rgb(204,204,204)">504.861</td></tr><tr style="height:21px"><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;border:1px solid rgb(204,204,204)">Propeller inter-procedural</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;font-size:11pt;font-weight:bold;color:rgb(0,0,0);text-align:right;border:1px solid rgb(204,204,204)">503.23</td><td style="overflow:hidden;padding:2px 3px;vertical-align:bottom;font-size:11pt;font-weight:bold;color:rgb(0,0,0);text-align:right;border:1px solid rgb(204,204,204)">502.136</td></tr></tbody></table>Clang's cpu time relative to the baseline, for different optimization flavors, with and without JCC mitigation<br></div><div style="text-align:justify"><br></div><div style="text-align:justify"><span id="gmail-m_387428305123375892gmail-docs-internal-guid-409f6a6f-7fff-a14f-7f82-0d03fe038d89"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">FIrst, JCC Mitigation results in a 0.6% slowdown when applied to the baseline. With Propeller, JCC mitigation incurs 0.4% slowdown for intra-procedural and 0.2% for inter-procedural. The lesser JCC mitigation slowdowns for Propeller configurations shows the impact of better code layout. When hot and cold code are mixed together, the paddings in the cold part could put more pressure on I-Cache and I-TLB.</span></span><br></div><div style="text-align:justify"><br></div><div style="text-align:justify"><span id="gmail-m_387428305123375892gmail-docs-internal-guid-09c8950e-7fff-fd6a-2f0a-fdac52d3a750"><h2 dir="ltr" style="line-height:1.38;margin-top:18pt;margin-bottom:6pt"><span style="font-size:16pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-weight:400;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Conclusion</span></h2><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Using BB clusters, we can reuse the assembler’s JCC mitigation with no significant impact on performance. In fact the slowdown caused by JCC mitigation is lower for Propeller, because of the better code layout.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-size:11pt;font-family:Arial;color:rgb(0,0,0);background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Finally, we would like to stress once again that Propeller has the potential to do a better job for problems like this JCC mitigation. However, for this particular problem, we have shown that the assembler's mitigation is good enough to be used along with Propeller.<br></span></p><br></span></div></span></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I met with the Propeller team today (we work for the same company but it<br>

was my first time meeting two members on the team:) ).<br>

One thing I have been reassured:<br>

<br>

* There is no general disassembly work. General<br>

disassembly work would assuredly frighten off developers.  (Inherently<br>

unreliable, memory usage heavy and difficult to deal with CFI, debug<br>

information, etc)<br>

<br>

Minimal amount of plumbing work (<a href="https://reviews.llvm.org/D68065" rel="noreferrer" target="_blank">https://reviews.llvm.org/D68065</a>) is<br>

acceptable: locating the jump relocation, detecting the jump type,<br>

inverting the direction of a jump, and deleting trailing bytes of an<br>

input section</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">. The existing linker relaxation schemes already do similar<br>

things. Deleting a trailing jump is similar to RISC-V where sections can<br>

shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in<br>

my mind)) (binutils supports deleting bytes for a few other<br>

architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal<br>

amount of disassembly work, conceptually the framework should not be too<br>

hard to be ported to another target.<br>

<br>

One thing I was not aware of (perhaps the description did not make it clear) is that<br>

Propeller intends to **reorder basic block sections across translation units**.<br>

This is something that full LTO can do while ThinLTO cannot.<br>

Our internal systems cannot afford doing a full LTO (**Can we fix the bottleneck of full LTO** [1]?)<br>

for large executables and I believe some other users are in the same camp.<br></blockquote><div><br></div><div>Right, beyond distributed build system, even on a single machine and for "small" projects like clang: building on a laptop with FullLTO can be challenging in terms of memory consumption, and the iterative development is just not practical.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Now, with ThinLTO, the post link optimization scheme will inevitably require<br>

help from the linker/compiler. It seems we have two routes:<br>

<br>

## Route 1: Current Propeller framework<br>

<br>

lld does whole-program reordering of basic block sections.  We can extend it in<br>

the future to overalign some sections and pad gaps with NOPs.  What else can we<br>

do? Source code/IR/MCInst is lost at this stage. Without general assembly<br>

work, it may be difficult to do more optimization.<br>

<br>

This makes me concerned of another thing: Intel's Jump Condition Code Erratum.<br>

<a href="https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf" rel="noreferrer" target="_blank">https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf</a><br>

<br>

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31<br>

(mod 32) should be avoided.  There are assembler level (MC) mitigations<br>

(function sections are overaligned to 32), but because we use basic<br>

block sections (sh_addralign<32) and need reordering, we have to redo<br>

some work at the linking stage.<br>

<br>

After losing the representation of MCInst, it is not clear to me how we can<br>

insert NOPs/segment override prefixes without doing disassembly work in the linker.<br>

<br>

Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation,<br>

CFI and debug information generating, and probably other stuff.<br>

(How will debug information be bloated?)<br>

<br>

## Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.<br>

<br>

Regular ThinLTO with minimized bitcode files:<br>

<br>

        all: compile thin_link thinlto_backend final_link<br>

<br>

        compile a.o b.o a.indexing.o b.indexing.o: a.c b.c<br>

                $(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c<br>

                $(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c<br>

<br>

        thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o<br>

                $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace=';lto' -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o b.indexing.o<br>

<br>

        thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc<br>

                $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o<br>

                $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o<br>

<br>

        final_link exe: lto/a.o lto/b.o a.rsp<br>

                # Propeller does basic block section reordering here.<br>

                $(clang) -fuse-ld=lld @a.rsp -o exe<br>

<br>

We need to replace the two stages thinlto_backend and final_link with<br>

three.<br>

<br>

Propelled ThinLTO with minimized bitcode files:<br>

<br>

        propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc<br>

                # Propeller emits something similar to a Machine IR file.<br>

                # a.o and b.o are all IR files.<br>

                $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir<br>

                $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir<br>

<br>

        propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir<br>

                # Propeller collects input Machine IR files,<br>

                # spawn threads to generate object files parallelly.<br>

                $(clang) -fpropeller-backend -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir<br>

<br>

        final_link exe: propeller/a.o propeller/b.o<br>

                # GNU ld/gold/lld links object files.<br>

                $(clang) $^ -o exe<br></blockquote><div><br></div><div>There was an interesting talk last week at the LLVM performance workshop: <a href="https://llvm.org/devmtg/2020-02-23/#kl" target="_blank">Global Machine Outliner for ThinLTO</a> which introduced a similar stage in ThinLTO (for another purpose though). I believe they avoid the serialization of MIR by running the CodeGen twice instead (once to collect the cross-module informations, and the second time using these informations).</div><div>CC the author in case the slides are already available online.<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

A .mir may be much large than an object file. So lto/a.mir may be<br>

actually an object file annotated with some information, or some lower<br>

level representation than a Machine IR (there should be a guarantee that<br>

the produced object file will keep the basic block structure unchanged<br>

=> otherwise basic block profiling information will not be too useful).<br>

<br>

<br>

<br>

[1]: **Can we fix the bottleneck of full LTO** [1]?<br>

<br>

I wonder whether we have reached a "local maximum" of ThinLTO.<br>

If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework?<br>

Apparently, if full LTO did not have the scalability problem, we would<br>

not do so much work in the linker?<br></blockquote><div></div><div><br></div><div>At lot of work went into ThinLTO because the scalability issue of LTO was considered inherent to the design. It isn't clear what you're suggesting here though?</div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div><div> </div></div></div></div></div>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a></blockquote></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I met with the Propeller team today (we work for the same company but it<br>

was my first time meeting two members on the team:) ).<br>

One thing I have been reassured:<br>

<br>

* There is no general disassembly work. General<br>

disassembly work would assuredly frighten off developers.  (Inherently<br>

unreliable, memory usage heavy and difficult to deal with CFI, debug<br>

information, etc)<br>

<br>

Minimal amount of plumbing work (<a href="https://reviews.llvm.org/D68065" rel="noreferrer" target="_blank">https://reviews.llvm.org/D68065</a>) is<br>

acceptable: locating the jump relocation, detecting the jump type,<br>

inverting the direction of a jump, and deleting trailing bytes of an<br>

input section</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">. The existing linker relaxation schemes already do similar<br>

things. Deleting a trailing jump is similar to RISC-V where sections can<br>

shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in<br>

my mind)) (binutils supports deleting bytes for a few other<br>

architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal<br>

amount of disassembly work, conceptually the framework should not be too<br>

hard to be ported to another target.<br>

<br>

One thing I was not aware of (perhaps the description did not make it clear) is that<br>

Propeller intends to **reorder basic block sections across translation units**.<br>

This is something that full LTO can do while ThinLTO cannot.<br>

Our internal systems cannot afford doing a full LTO (**Can we fix the bottleneck of full LTO** [1]?)<br>

for large executables and I believe some other users are in the same camp.<br></blockquote><div><br></div><div>Right, beyond distributed build system, even on a single machine and for "small" projects like clang: building on a laptop with FullLTO can be challenging in terms of memory consumption, and the iterative development is just not practical.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Now, with ThinLTO, the post link optimization scheme will inevitably require<br>

help from the linker/compiler. It seems we have two routes:<br>

<br>

## Route 1: Current Propeller framework<br>

<br>

lld does whole-program reordering of basic block sections.  We can extend it in<br>

the future to overalign some sections and pad gaps with NOPs.  What else can we<br>

do? Source code/IR/MCInst is lost at this stage. Without general assembly<br>

work, it may be difficult to do more optimization.<br>

<br>

This makes me concerned of another thing: Intel's Jump Condition Code Erratum.<br>

<a href="https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf" rel="noreferrer" target="_blank">https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf</a><br>

<br>

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31<br>

(mod 32) should be avoided.  There are assembler level (MC) mitigations<br>

(function sections are overaligned to 32), but because we use basic<br>

block sections (sh_addralign<32) and need reordering, we have to redo<br>

some work at the linking stage.<br>

<br>

After losing the representation of MCInst, it is not clear to me how we can<br>

insert NOPs/segment override prefixes without doing disassembly work in the linker.<br>

<br>

Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation,<br>

CFI and debug information generating, and probably other stuff.<br>

(How will debug information be bloated?)<br>

<br>

## Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.<br>

<br>

Regular ThinLTO with minimized bitcode files:<br>

<br>

        all: compile thin_link thinlto_backend final_link<br>

<br>

        compile a.o b.o a.indexing.o b.indexing.o: a.c b.c<br>

                $(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c<br>

                $(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c<br>

<br>

        thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o<br>

                $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace=';lto' -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o b.indexing.o<br>

<br>

        thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc<br>

                $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o<br>

                $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o<br>

<br>

        final_link exe: lto/a.o lto/b.o a.rsp<br>

                # Propeller does basic block section reordering here.<br>

                $(clang) -fuse-ld=lld @a.rsp -o exe<br>

<br>

We need to replace the two stages thinlto_backend and final_link with<br>

three.<br>

<br>

Propelled ThinLTO with minimized bitcode files:<br>

<br>

        propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc<br>

                # Propeller emits something similar to a Machine IR file.<br>

                # a.o and b.o are all IR files.<br>

                $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir<br>

                $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir<br>

<br>

        propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir<br>

                # Propeller collects input Machine IR files,<br>

                # spawn threads to generate object files parallelly.<br>

                $(clang) -fpropeller-backend -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir<br>

<br>

        final_link exe: propeller/a.o propeller/b.o<br>

                # GNU ld/gold/lld links object files.<br>

                $(clang) $^ -o exe<br></blockquote><div><br></div><div>There was an interesting talk last week at the LLVM performance workshop: <a href="https://llvm.org/devmtg/2020-02-23/#kl" target="_blank">Global Machine Outliner for ThinLTO</a> which introduced a similar stage in ThinLTO (for another purpose though). I believe they avoid the serialization of MIR by running the CodeGen twice instead (once to collect the cross-module informations, and the second time using these informations).</div><div>CC the author in case the slides are already available online.<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

A .mir may be much large than an object file. So lto/a.mir may be<br>

actually an object file annotated with some information, or some lower<br>

level representation than a Machine IR (there should be a guarantee that<br>

the produced object file will keep the basic block structure unchanged<br>

=> otherwise basic block profiling information will not be too useful).<br>

<br>

<br>

<br>

[1]: **Can we fix the bottleneck of full LTO** [1]?<br>

<br>

I wonder whether we have reached a "local maximum" of ThinLTO.<br>

If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework?<br>

Apparently, if full LTO did not have the scalability problem, we would<br>

not do so much work in the linker?<br></blockquote><div></div><div><br></div><div>At lot of work went into ThinLTO because the scalability issue of LTO was considered inherent to the design. It isn't clear what you're suggesting here though?</div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div><div> </div></div></div></div></div>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div>