[PATCH] D148045: [NFC][Symbolizer] Refactor out ProcessContext

Tue May 2 13:04:43 PDT 2023

mcgrathr added inline comments.

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:50
+    std::string Mode; // Lowercase
+    uint64_t ModuleRelativeAddr;
+
----------------
mysterymath wrote:
> mcgrathr wrote:
> > I'm not sure this is the best name for this, though maybe it is. If this is meant to be just the parsed integer from the last field in the mmap markup element, then I think it's not the best name for that. To me, "module-relative" is ambiguous between "in the local terms of the module" and "relative to the low address of the module". I usually try to speak about "runtime address" and "link-time address", which are what the two congruent addresses in an mmap element are.
> > 
> > In many modules, "relative to the load address of the module" is the same as "in the terms of the module", because the low address of the module is zero. (By low address I mean the p_vaddr of the first PT_LOAD, truncated to runtime page size, which should be the same as truncated to its p_align.)  This is most commonly zero nowadays, but it is not always. In an ET_EXEC file, it's rarely zero. But even though the prelink tool is no longer used (glibc removed the support for its special dynamic linking optimization semantics), it remains valid for an ET_DYN file to be statically linked with a nonzero low address (that serves as its preferred load address at runtime, but doesn't constrain its actual load address chosen by the system at runtime, like an ET_EXEC file, does).  For example, today some Fuchsia kernels are linked like ET_EXEC files with nonzero link-time addresses (these aren't ELF files at all in the runtime memory image, but they still have build IDs and report mmap elements corresponding to PT_LOAD segments in a debugging ELF file) and other Fuchsia kernels are vanilla ET_DYN files linked at zero--but both have runtime addresses almost always different from their link-time addresses by a runtime load bias computable as the difference between the two fields in the mmap element they report.
> > 
> > So I'd be inclined to have `RuntimeAddr, LinktimeAddr, Size, Prot` as the members here, with thorough comments about the semantics of each. It can cross-reference with the markup spec but should explain everything independent of it here IMHO.
> > I'm not sure this is the best name for this, though maybe it is. If this is meant to be just the parsed integer from the last field in the mmap markup element, then I think it's not the best name for that. To me, "module-relative" is ambiguous between "in the local terms of the module" and "relative to the low address of the module". I usually try to speak about "runtime address" and "link-time address", which are what the two congruent addresses in an mmap element are.
> > 
> > In many modules, "relative to the load address of the module" is the same as "in the terms of the module", because the low address of the module is zero. (By low address I mean the p_vaddr of the first PT_LOAD, truncated to runtime page size, which should be the same as truncated to its p_align.)  This is most commonly zero nowadays, but it is not always. In an ET_EXEC file, it's rarely zero. But even though the prelink tool is no longer used (glibc removed the support for its special dynamic linking optimization semantics), it remains valid for an ET_DYN file to be statically linked with a nonzero low address (that serves as its preferred load address at runtime, but doesn't constrain its actual load address chosen by the system at runtime, like an ET_EXEC file, does).  For example, today some Fuchsia kernels are linked like ET_EXEC files with nonzero link-time addresses (these aren't ELF files at all in the runtime memory image, but they still have build IDs and report mmap elements corresponding to PT_LOAD segments in a debugging ELF file) and other Fuchsia kernels are vanilla ET_DYN files linked at zero--but both have runtime addresses almost always different from their link-time addresses by a runtime load bias computable as the difference between the two fields in the mmap element they report.
> > 
> > So I'd be inclined to have `RuntimeAddr, LinktimeAddr, Size, Prot` as the members here, with thorough comments about the semantics of each. It can cross-reference with the markup spec but should explain everything independent of it here IMHO.
> I do like `RuntimeAddr` and `LinktimeAddr`, and uses of unadorned `Addr` should definitely be made into `RuntimeAddr` when appropriate. I ended up doing a pretty significant mental detour due to the term "module-relative address" when writing these changes; I had assumed incorrectly that these were the LMAs of the memory regions. "Module relative address" is a phrase cribbed directly from the symbolizer markup reference doc though, and it seems strange to have them out of sync. How would you feel about renaming the relevant terms of the `mmap` element to `runtime address` and `link-time address`?
> tof the expected 
I think we should certainly improve the wording and terminology choices of the spec whenever it improves clarity, and striving for consistent terminology across docs and code seems like a good thing for clarity.

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:53
+        JOS.attribute("type", "elf");
+        JOS.attribute("buildID", toHex(Module->BuildID, /*LowerCase=*/true));
+        JOS.objectEnd();
----------------
mysterymath wrote:
> mcgrathr wrote:
> > I'm not partial to mixed-case for JSON keys in the schema.
> > I know I'm coming in late since this is just refactoring the generation code and the JSON schema details are outside the scope of this change per se. But I imagine it's all still new enough and in flux such that we can refine it freely.
> > 
> > I'd use "build_id" here, but "buildid" would also be fine if underscores are undesirable in some quarters.
> > 
> I did a quick look around the larger JSON-verse to try to see which way the wind was blowing, and it looks like camelCase is the de-facto standard due to JS's influence.
> It's the official recommendation of a number of JSON standards (e.g., Google's JSON Style Guide, schema.org, JSON-API), while I couldn't find anything mandating snake, kebab, or Pascal case
> I wasn't particularly scientific about this though, and IIRC, I've observed all four conventions in the LLVM codebase.
Finding one style and staying consistent with it across tools and schemas seems like the most important thing, along with actually documenting/formalizing the schemas. I'm not familiar with what all places in LLVM have invented JSON schemas. This one and the readelf/readobj output schemas are two relatively recent ones that I think are still subject to change, so being consistent in style between at least those two seems like a good starting point.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148045/new/

https://reviews.llvm.org/D148045