<div dir="ltr">The generic-abi thread has gone into broader subjects of the benefits and desireability of the work. I'm willing to take it as given that the encoded size of pure-relative address relocs (i.e. R_*_RELATIVE equivalents)--ultimately the RODATA segment size of a given ET_DYN file--as sole metric is a worthy goal and the ballpark savings ratios we're seeing are worth committing to a new ABI. But I am circumspect about choosing an encoding we will be supporting for decades to come. Whatever we do now will surely be good enough that nobody will want to innovate again for many years just to get to a little or a fair bit better. It behooves us to be deliberate in getting it as good as we reasonably can get it now for the broad range of ET_DYN files that will appear in years to come.<div><br></div><div>I tend to share the intuitions people have expressed about what kinds of patterns of offsets are likely. I also have a contrary intuition that there are large codebases with lots of formulaic or generated code and data tables that may well have truly enormous numbers of such relocs that fit highly regular patterns just slightly different from the ones we're considering most likely.</div><div><br></div><div>Moreover I don't think there is any excuse for relying on our intuition when there are vast quantities of actual data pretty readily available. I don't mean picking a few "important" real-world binaries and using their real data. Examples like Chrome and Firefox have already been tuned by sophisticated developers to minimize relocation overhead and may well not be representative of other programs of similar size and complexity. I mean collecting data from a large and varied corpus of ET_DYN files across machines, operating systems, and codebases.</div><div><br></div><div>A pretty simple analysis tool can extract from any ET_DYN file the essential statistics (byte sizes of segments and relocs) and the list of R_*_RElATIVE reloc r_offset values. (I started writing one in Python and I can finish it if people want to use it.) It's easy enough to feed that with lots of ET_DYN files available in public collections such as Linux distros. The tool is simple to run and the data extracted not really revealing (beyond rough binary size), so it can be applied to lots of proprietary sets of ET_DYN files and the data contributed to the public corpus, from places like Google's internal binaries, Android system binaries, ET_DYN files in APKs in app stores, etc. across many companies.</div><div><br></div><div>Given this corpus of "reloc traces" you can code up many competing encoding formats and do serious measurements of their space savings across the entire corpus from simple simulations without having to implement each encoding in an actual toolchain and dynamic linker to do the analysis. This is some work, but I don't think it's a huge undertaking in comparison to the actual full deployment roll-out of a new ELF dynamic linking ABI feature and its impact, which always wind up being much more than just the actual new code in toolchain and runtime implementations. I think the work all over that will ripple out from the deployment, and the many-year commitment to the new format that will inevitably be incurred, merit a rigorous and public data-driven approach.</div><div><br></div></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><p dir="ltr"><br>
Thanks,<br>
Roland</p>
</div>