<div dir="ltr"><div dir="ltr"></div><div class="gmail_quote"><div>I respond with regards to "unwind tables" and "code layout".</div><div><br></div><div>WIth regards to "unwind tables", the idea of splitting call-sites tables has been implemented in D73739.<br></div><div>I think "asynchronous unwind tables" are already supported by our patch (Meaning that the unwind tables are precise at instruction boundary).</div><div>Please let us know if you believe otherwise.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">* An algorithm to the extended travelling salesman problem looks fancy.<br>

   It also introduced a bunch of parameters. Have other ordering approaches been considered?<br></blockquote><div><br></div><div>The Extended-TSP algorithm is published in <a href="https://arxiv.org/abs/1809.04676" target="_blank">https://arxiv.org/abs/1809.04676</a> and shows a speedup improvement of about 0.7% compared to TSP (on Clang), which I confirmed independently.</div><div>The general idea of considering jump distance in the layout decision is ubiquitous (<a href="https://dl.acm.org/doi/10.1145/3302516.3307358" target="_blank">https://dl.acm.org/doi/10.1145/3302516.3307358</a> (codestitcher), <a href="https://dl.acm.org/doi/10.5555/3049832.3049858" target="_blank">https://dl.acm.org/doi/10.5555/3049832.3049858</a> (hfsort) ). </div><div>We have considered other ordering approaches such as pettis-hansen and codestitcher (both of which are  based on TSP) and we have found Ext-TSP to be the best.</div><div>You can also refer to the codestitcher and ext-TSP paper for more extensive study.</div><div><br></div><div>Let me point out that although the formulation of ext-TSP is more involved than TSP, the heuristic algorithm that is used to solve it is essentially the same as the one for TSP. It's an incremental changing algorithm aimed at maximizing the gain in the score at each step.</div><div><br></div><div>On the other hand, most of the complication in the algorithm is caused by split-and-remerge (guided by -propeller-chain-split-threshold) which is highly beneficial for escaping local optima, resulting in 2% more performance improvement. The original ext-TSP work sets the split threshold at 128 bytes, while we can achieve better speedup by setting it to 1024bytes, without sacrificing the complication time (The whole reordering algorithm takes about 4 seconds on clang).</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

   I am concerned that the raw ELF fields (like sh_size) are used for<br>

   heuristics. It seems to try reconstructing some information about<br>

   Machine{Function,BasicBlock,Instr,...} about doing that in the<br>

   Machine* stage may just be simpler and more accurate. </blockquote><div><br></div><div>Yes. Basic block (binary) sizes are used to compute the binary distance for jumps in the final layout (as used by the ext-TSP algorithm). We can currently get this information both from the object code (sections), or the BB symbol sizes (propagated through the propeller profile). Getting this information in the machine stage is not as accurate and convenient. For instance, StackMapShadowTracker has been implemented in llvm to count the instruction bytes after the last StackMap and it shares some code between different classes (X86AsmPrinter and X86MCInstLower).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

We feel that the available optimizations may be limited. The lld focused<br>

implementation may have some improvement in the near term but the<br>

benefits will dilute quickly when the Machine{Function,BasicBlock}<br>

interfaces get improved.<br>

<br>

Also, doing heavy-lifting work on the linker side:<br>

<br>

Needs non-trivial work to port to another binary format.  Another target<br>

may be a very different story. (Even for x86-64, I see the patches<br>

attend to address R_X86_64_PC8/R_X86_64_PC32/R_X86_64_PLT32 calls, but<br>

the two relocation types are not handled.)<br>

The newly lld/ELF code overlaps with jobs of MachineBlockPlacement/BranchFolding/AsmPrinter/etc.<br>

<br>

If the current MachineFunction interface is not flexible (e.g. it does<br>

not allow sharable MachineBasicBlocks between MachineFunctions.) enough,<br>

let's fix it.<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div></div>