[PATCH] D148047: [Symbolizer] Parse ProcessContext from JSON

Fri Apr 28 22:55:35 PDT 2023

mcgrathr added a comment.

I'd really like to see a formally written JSON schema for this, ideally in actual JSON schema format as well as human-readable documentation (or perhaps formal JSON schema with enough "description" elements can be fully sufficient). It also seems ideal to have the ProcessContext->JSON modularly separate from the markup parser, and to do some unit tests for each direction of that transformation that use each other to verify they match each other as well as matching their own hand-constructed unit test cases.  I also wonder if it might be worth writing the input test cases as actual JSON raw strings that get parsed by a JSON parser during the test (or I suppose at compile time), to take the human factor out of reviewing how the test-constructed JSON data matches the human's mental model of the JSON.

For the markup parser it's certainly convenient to produce a structure that collects things by element. But I wonder if it might not be a more generally desirable schema to just group as a list of modules with each module containing a list of mmap elements. (This is a separate question from how ProcessContext is structured, but having it also structured in a way that makes it easy to iterate over modules as well as do random-access address->module mapping seems worth considering.) Note that in the general case, it need not be the case that every module has all its mmap regions grouped in an address range where no other module has any mmap regions interleaved into the holes. So if it were part of the schema that the mmap regions are in address order, then that could be a point in favor of a flat mmaps list as you have rather than per-module lists. Since even if a per-module list is internally sorted, a flat and presorted mmaps list can be used for binary search to map addresses without doing any sorting in the input data from JSON. But it's not at all clear here whether sorted mmaps are a requirement of the schema. There's no address ordering requirement for mmap elements in markup format (just IIRC a referential ordering requirement that a module element precede mmap elements using its ID).  But it certainly seems worthwhile for ProcessContext to offer address-ordered region enumeration as well as module enumeration for its likely future uses (e.g. if this library code is one day shared with debuggers and such).

================
Comment at: llvm/include/llvm/DebugInfo/Symbolize/ProcessContext.h:37
+class ProcessContext;
+bool fromJSON(const json::Value &Value, ProcessContext &Result,
+              json::Path Path);
----------------
This makes it very natural that toJSON be here too instead of intermingled with the markup parser.

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:38

+std::optional<std::string> ProcessContext::parseMode(StringRef Str) {
+  if (Str.empty())
----------------
Since it parses a string and returns a string in the same format, I think calling this "normalize" rather than "parse" might make sense.

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:42
+
+  // Pop off each of r/R, w/W, and x/X from the front, in that order.
+  StringRef Remainder = Str;
----------------
Even though the markup spec says they should be in this order and nothing else there, it might be worthwhile to be permissive here about the order and also about accepting something like `-` and/or `.` as ignorable placeholders that don't get flagged as "incorrect format", and maybe even certain extraneous letters like `p`. That would e.g. make this reusable to parse the second column of Linux /proc/PID/maps format. (And whatever the spec says, it seems quite plausible someone might just use that text verbatim when producing markup if that's where they're getting the information.)

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:115
+  json::ObjectMapper O(Value, Path);
+  if (!O || !O.map("id", Result.ID) || !O.map("name", Result.Name))
+    return false;
----------------
There's no point in a module without a module ID, but both name and build ID could be empty in the markup and seems like it might be better in the schema to allow omitting those keys rather than requiring they be present with an empty value.

================
Comment at: llvm/lib/DebugInfo/Symbolize/ProcessContext.cpp:121
+    return false;
+  if (Type != "elf") {
+    Path.field("type").report("unsupported type");
----------------
I wonder whether it's worth enforcing this here. We will one of these days definitely be using markup for PE-COFF objects (in fact in UEFI code we already are and are just lying about the type in the markup we emit), and probably eventually MachO and whatever else too. It's not clear that anything in the JSON schema or even the ProcessContext object will really be different for those non-ELF cases. So I'm not sure whether we should just plan to carefully match all types in parser code (both here and in markup parsing) and perhaps start sharing code for the list of supported types when it expands, or just accept any type in places that don't need to care. (It's not even clear full support would really need to care, since if there's a build ID of a non-ELF form then certain kinds of build ID-indexing might find a file and just have tools grok whatever format that file is when they open it without caring to know what format it's expected to be in. Though it does seem quite plausible that we'd want the debuginfod protocol's name schemas to differ for different types of files, so that could go either way.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148047/new/

https://reviews.llvm.org/D148047