[PATCH] D144009: [obj2yaml] Save offset for segments and size for PHDR

Fri Feb 17 00:38:35 PST 2023

jhenderson added a comment.

In D144009#4131263 <https://reviews.llvm.org/D144009#4131263>, @treapster wrote:

> In D144009#4131236 <https://reviews.llvm.org/D144009#4131236>, @jhenderson wrote:
>
>> Could you clarify why you only emit the sizes for the PHDR type? It's perfectly legitimate for other segment types to have a size that covers other areas too that aren't parts of sections (and therefore the size can't be derived purely from the section size), but IIRC, yaml2obj will assume the sizes match unless told otherwise. I suppose it might be necessary to emit Fill "sections" in the YAML to cover the gap in some cases to ensure section layout works correctly.
>
> I don't know of any practical examples of segments aside from PHDR and first load which cover out-of-section space, and with these two cases, the size will be properly adjusted by yaml2obj to cover everything from the provided offset to the last included section, or we'll provide it for PHDR in absence of sections.

FWIW, I know of situations within an internal toolchain where this can happen, although we don't use the LLVM tools in that situation, I believe. Also, if explicitly instructed to remove a section that is within a segment, llvm-objcopy will preserve the gap in the segment, to avoid invalidating addresses, so it could arise from that situation. Finally, since yaml2obj supports the concept of "Fill" sections (sections that aren't actually sections in the section header table, i.e. they just cover blocks of empty space), it's possible to create an object with gaps in that way, without explicitly specifying the offset/size of program headers. obj2yaml doesn't currently support emitting Fill sections, but it could.

When we did some work a few years ago to yaml2obj to allow it to create "Fill" sections, we discussed creating Chunks for the ELF Header and Program Header Tables (like we do already with the Sections and SectionHeaderTable), with the idea that you could list them in the FirstSec/LastSec listing of a segment. I feel like this would be a better way of ensuring the size of a segment (especially a PT_PHDR segment) is correct, rather than hard-coding the size. Hard-coding the size means it will fall down if somebody decides to manually add an additional program header to the YAML produced by obj2yaml, for example.

================
Comment at: llvm/tools/obj2yaml/elf2yaml.cpp:488-493
+    PH.Offset = Phdr.p_offset;
+
+    if (Phdr.p_type == ELF::PT_PHDR) {
+      PH.MemSize = Phdr.p_memsz;
+      PH.FileSize = Phdr.p_filesz;
+    }
----------------
treapster wrote:
> jhenderson wrote:
> > I think we should be emitting these values in the output only if they can't be derived automatically from the existing layout, otherwise the YAML output will end up unnecessarily verbose. It's been a while since I've looked at the obj2yaml code though, so it's possible this happens elsewhere.
> I don't think one line per segment is so verbose that we should build separate logic to determine whether to include it. What makes yaml verbose are symbols, relocations, dynamic, note sections and etc. TBH i think we should also include section offsets unconditionally because people strip sections all the time which breaks all this smart offset derivation and rebuilt binaries have absolutely nonsensical offset/address values, which makes it difficult to make some tools(e.g BOLT) more strict to the input binaries, since all the generated tests will fail.  
> But saving section offsets in all cases requires changing about 40 tests(that much failed when i just started always including section offsets), so unfortunately i won't do it anytime soon.
> I don't think one line per segment is so verbose that we should build separate logic to determine whether to include it.

My reasoning here is that a significant use-case I'm aware of is for people creating test cases that use yaml2obj to generate the test input at runtime. Essentially, they get an object they want to test using the compiler/assembler/etc and then use obj2yaml to generate the YAML for use in the test. If the fields are emitted unnecessarily this causes two problems: 1) the YAML becomes more verbose than necessary, obfuscating what is actually important in the test and 2) it makes it harder to manipulate the YAML to add/remove sections from it (because the offsets etc all become invalid and need updating by hand). Thinking about it, I'm more concerned by the second of these.

> people strip sections all the time which breaks all this smart offset derivation and rebuilt binaries have absolutely nonsensical offset/address values

Sorry, I'm not sure what you're saying here? Are you saying people use tools like llvm-strip/llvm-objcopy to manipulate sections? If so, the offsets and addresess of any remaining sections and segments should remain valid afterwards (and if they don't, that's a bug in llvm-strip/llvm-objcopy etc). Alternatively, are you saying people hand-edit the YAML after obj2yaml emitted it? If it's the latter, emitting more offsets can cause other issues.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144009/new/

https://reviews.llvm.org/D144009