[PATCH] D148045: [NFC][Symbolizer] Refactor out ProcessContext

Fri Apr 28 21:53:07 PDT 2023

mcgrathr accepted this revision.
mcgrathr added a comment.
This revision is now accepted and ready to land.

This is a pretty pure refactoring, so it probably makes sense to address all my comments separately. But I'm happy for this to land with or without the various cosmetic changes I'm recommending here. While I'm not that particular about all the exact details of the changes, I would prefer that we refine these things in those directions incrementally if not as part of these other refactorings.

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/MarkupContext.h:37
+/// module-relative addresses.
+class MarkupContext {
+public:
----------------
phosek wrote:
> mysterymath wrote:
> > phosek wrote:
> > > Do you think it'd be more descriptive to name this class `ProcessContext`?
> > I like this a lot. It was a bit strange that llvm-objump referred to markup in the patch at the end of the stack; just calling these process contexts makes markup one way to produce them, but frees up the possibiliy for others.
> That was exactly my thinking, I could imagine producing the context from other sources such as `/proc/<pid>/maps` which is frequently done even in LLVM, for example https://github.com/llvm/llvm-project/blob/fa2a8c2e1cdf188ce985d69fca6f78866390b715/compiler-rt/lib/sanitizer_common/sanitizer_procmaps_linux.cpp or https://github.com/llvm/llvm-project/blob/fa2a8c2e1cdf188ce985d69fca6f78866390b715/lldb/source/Plugins/Process/Utility/LinuxProcMaps.cpp#L27.
> 
> Relatedly, in LLVM, we already have https://github.com/llvm/llvm-project/blob/fa2a8c2e1cdf188ce985d69fca6f78866390b715/compiler-rt/lib/sanitizer_common/sanitizer_procmaps.h#L39, https://github.com/llvm/llvm-project/blob/fa2a8c2e1cdf188ce985d69fca6f78866390b715/lldb/include/lldb/Target/MemoryRegionInfo.h#L21 and https://github.com/llvm/llvm-project/blob/fa2a8c2e1cdf188ce985d69fca6f78866390b715/llvm/include/llvm/ProfileData/MemProfData.inc#L54 which are used for similar purpose. I don't think it's worth trying to unify these at this point, but something to keep in mind for the future.
ProcessContext is a fine enough name. To be pedantic, it's not necessarily a "process" per se, but is an address space context. For example, we use this with kernel code where there are no "processes" involved in the thing being described, but there is of course an address space context for what register values and stored pointers refer to.

As to future generalization, that's certainly a good plan. You might refer to some elfutils tools, which all get this support via libdwfl (part of libdw). There isn't anything like a JSON format, but there is the scriptable `eu-unstrip -n` output format. Tools like eu-unstrip, eu-add2rline, eu-stack use common library code for some common command-line options to pick supported kinds of input, which include a running process, a core file, the running Linux kernel, but also the text format of `/proc/PID/maps` which is what the direct PID switch really uses underneath but usable on text captured earlier without the process still being live. A standardized JSON format that represents everything that all of those flavors support seems like something everyone would want to adopt, and a good way to keep it clear exactly what kinds of context information all the different kinds of input sources have in common (and notice if any have additional information we might want to specify in the schema as optional).

On Fuchsia, gcore produces core files that have NT_GNU_BUILD_ID notes directly in them so that you can get the list of build ID-indexed files and their (implied) load addresses just by reading those. (Then you have to find the vaddr of the build ID in the offline file once you've fetched it, to compute the load bias between the runtime vaddr of the build ID's Elf_NHdr and the static vaddr of those bytes in the file.)   Like ET_CORE files from other systems, they will also have PT_LOAD segments covering the ELF headers and build IDs dumped from memory so you can just scan the dumped memory to reconstruct them that way. Linux by default dumps enough of each ELF file that you likely can parse out the phdrs and build IDs. (Both elfutils and fuchsia/src/lib/zxdump have code you can refer to doing this kind of groveling for ELF images inside memory images.)

In particular parsing from `/prod/PID/maps` (Linux) format is useful both for tools like this, and where Petr pointed to it being done in runtime library code. The latter has very different constraints on how things are done so it's plausible it's better to reimplement that parser in LLVM library code meant for the environment rich with LLVM support libraries than to try to share it with code appropriate for a sanitizer or unwinder runtime.

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:48
+    uint64_t Size;
+    const Module *Mod;
+    std::string Mode; // Lowercase
----------------
I think it's helpful for the pointer member to be either before or after all the non-pointer members here. The others are all data that's just parsed directly from the mmap element in the markup input case. The pointer seems different.

I think this member needs a comment saying if it can ever be null, and what that means. For example, in markup input you could have mmap elements that refer to a module ID for which no module element appears. It's plausible these can still be useful, since the sequence of mmap elements for the module and their runtime<->link-time address correspondences could still be useful by itself to someone. (Imagine markup in logging text where some lines have been garbled or lost, but you need to do all you can semi-manually to reconstruct the address space layout from candidate ELF images the system might have been running.)

It would make sense to say Mod can be null if the module wasn't found. It could also be reasonable to say that Mod is never null, but that a module ID never found is represented by Module{ID,{},{}}.

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:49
+    const Module *Mod;
+    std::string Mode; // Lowercase
+    uint64_t ModuleRelativeAddr;
----------------
I think it's helpful to describe in comments here exactly what this string contains and what that means, even if the name "mode" is used and defined elsewhere in the markup spec documentation.  Users of this data structure in future aren't necessarily working with markup format, since there can be other ways to populate this data structure from live processes, core dumps, etc.

Does the comment mean that this has been downcased as normalization from the input string?

If that's what this is, then a simple comment like `// Downcased [r-][w-][x-] string.` can be all it needs.

If the markup format allows e.g. any order of the recognized letters, then it might be useful to normalize this to canonical order, and then the comment can just say "Normalized lowercase ...", etc.

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:50
+    std::string Mode; // Lowercase
+    uint64_t ModuleRelativeAddr;
+
----------------
I'm not sure this is the best name for this, though maybe it is. If this is meant to be just the parsed integer from the last field in the mmap markup element, then I think it's not the best name for that. To me, "module-relative" is ambiguous between "in the local terms of the module" and "relative to the low address of the module". I usually try to speak about "runtime address" and "link-time address", which are what the two congruent addresses in an mmap element are.

In many modules, "relative to the load address of the module" is the same as "in the terms of the module", because the low address of the module is zero. (By low address I mean the p_vaddr of the first PT_LOAD, truncated to runtime page size, which should be the same as truncated to its p_align.)  This is most commonly zero nowadays, but it is not always. In an ET_EXEC file, it's rarely zero. But even though the prelink tool is no longer used (glibc removed the support for its special dynamic linking optimization semantics), it remains valid for an ET_DYN file to be statically linked with a nonzero low address (that serves as its preferred load address at runtime, but doesn't constrain its actual load address chosen by the system at runtime, like an ET_EXEC file, does).  For example, today some Fuchsia kernels are linked like ET_EXEC files with nonzero link-time addresses (these aren't ELF files at all in the runtime memory image, but they still have build IDs and report mmap elements corresponding to PT_LOAD segments in a debugging ELF file) and other Fuchsia kernels are vanilla ET_DYN files linked at zero--but both have runtime addresses almost always different from their link-time addresses by a runtime load bias computable as the difference between the two fields in the mmap element they report.

So I'd be inclined to have `RuntimeAddr, LinktimeAddr, Size, Prot` as the members here, with thorough comments about the semantics of each. It can cross-reference with the markup spec but should explain everything independent of it here IMHO.

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:53
+    bool contains(uint64_t Addr) const;
+    uint64_t getModuleRelativeAddr(uint64_t Addr) const;
+  };
----------------
I think it would be worthwhile to use uniform argument names for the different semantic kinds of addresses, e.g. RuntimeAddr everywhere instead of just Addr. Then if we want to have things that refer to LinktimeAddr or ModuleOffset or whatever other terms they can all be clear and well-explained centrally in one comment per canonical name.

(I'm still not sure whether ModuleRelativeAddr here means what I'd call LinktimeAddr or what I'd call ModuleOffset, the latter being 0 at the beginning of the truncated file p_vaddr and the former just being that truncated file p_vaddr.)

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:74
+  /// existing one.
+  std::pair<const MMap *, bool> insertMMap(MMap MM);
+
----------------
Perhaps this should state some invariant about MM->Mod or it should be passed the ID and itself be responsible for setting the inserted MMap::Mod pointer. That is, unless something in the contract and enforced with assertions or else implicit and so impossible to get wrong requires Mod to be something getModule could return (i.e. that insertModule did return), then it seems quite likely someone will pass  a copied MMap::Mod that points to the copied Module temporary they passed insertModule, rather than the pointer it returned, and hilarity will ensue.

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:53
+        JOS.attribute("type", "elf");
+        JOS.attribute("buildID", toHex(Module->BuildID, /*LowerCase=*/true));
+        JOS.objectEnd();
----------------
I'm not partial to mixed-case for JSON keys in the schema.
I know I'm coming in late since this is just refactoring the generation code and the JSON schema details are outside the scope of this change per se. But I imagine it's all still new enough and in flux such that we can refine it freely.

I'd use "build_id" here, but "buildid" would also be fine if underscores are undesirable in some quarters.

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:60
+        JOS.objectBegin();
+        JOS.attribute("address", Map.Addr);
+        JOS.attribute("size", Map.Size);
----------------
I'd change these names in the schema to match my comments about the data structure above too, as well as avoiding mixed case.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148045/new/

https://reviews.llvm.org/D148045