<div>Thanks for the useful information. We notice that the idea of LIPO also can help LLVM LTO if LLVM has FDO/PGO. And regarding Diablo, we'll learn from it and I think we'll get some good ideas from it.</div>
<div><br></div><div>In MCLinker, the detail of the instructions and data in bitcode are still kept during linking, so some opportunities to optimize the instruction in bitcode become intuitive. Instruction relaxation is one of the cases. (Since ARM is one of the target we focus on, I'm going to use ARM to illustrate the problem.)</div>
<div><br></div><div>When linking bitcode and other object files, stubs are necessary if the branch range is too far or ARM/THUMB mode switching. Google gold linker uses two kinds of stubs basically. One is consecutive branch instructions, and the other is one branch instruction with one following instruction (e.g., ldr) which changes PC directly.</div>
<div><br></div><div>Example of the later cases,</div><div><br></div><div>1: bl <stub_address></div><div>...</div><div>2: ldr pc, [pc, #-4] ; stub</div><div>3: dcd R_ARM_ABS32(X)</div><div><br></div><div>In MCLinker, we can optimize it as following:</div>
<div> </div><div>X: ldr ip, [pc, #-4]</div><div>Y: dcd R_ARM_ABS32(X)</div><div>Z: bx ip</div><div><br></div><div>Before optimization, some processors suffer from flushing ROB/Q because their pipelines are fulfilled with the invalid instructions that immediately appear after ldr. However, all of these instructions should not be executed, and processors must flush them when ldr is committed.</div>
<div><br></div><div>Since all details of instruction and data are reserved, MCLinker can directly rewrite the program without insertion of stub. It can replace the 1:bl instruction with a longer branch Z: bx, and the performance of the program is therefore improved by efficient use of branch target buffer (BTB).</div>
<div>This is just one case, and there are other optimizations we can do..</div><div><br></div>Thanks,<div>Chinyen</div><div><br></div>
<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">> In GCC, LTO causes 'fat' object files, because GCC needs to serialize<br>
> IR into 'intermediate language' (IL) and compress IL in object files.<br>
> In our experience, the 'fat' object files are x10 bigger than the<br>
> original one, and slow down the linking process significantly. The<br>
> generated code can get about only 7%~13% improvement.<br>
<br>
Right. Though GCC 4.7 will offer an option to emit just bytecode in<br>
object files. Additionally, the biggest gains we generally observe<br>
with LTO is when it's coupled with FDO. And almost always, the<br>
biggest wins are in the inliner<br>
(<a href="http://gcc.gnu.org/wiki/LightweightIpo" target="_blank">http://gcc.gnu.org/wiki/LightweightIpo</a>).<br>
<br>
> Apart from the LTO, we also have some good idea on link time<br>
> optimization. I will open another thread to discuss this later.<br>
<br>
You may want to look at Diablo (<a href="http://diablo.elis.ugent.be/" target="_blank">http://diablo.elis.ugent.be/</a>). An<br>
optimizing linker that has been around for a while. I'm not sure<br>
whether it is still being developed, but they had several interesting<br>
ideas in it.<br>
<br>
<br>
Diego.<br>
</blockquote></div><br>