[PATCH] D152159: [AppleAccelTable][NFC] Refactor iterator class

Tue Jun 6 08:27:18 PDT 2023

aprantl added inline comments.

================
Comment at: llvm/lib/DebugInfo/DWARF/DWARFAcceleratorTable.cpp:279
+        FormValue.extractValue(Table.AccelSection, Offset, Table.FormParams);
+    assert(ExtractWorked && "Failed to extract value from hash data entry");
+  }
----------------
fdeazeve wrote:
> dblaikie wrote:
> > fdeazeve wrote:
> > > dblaikie wrote:
> > > > fdeazeve wrote:
> > > > > aprantl wrote:
> > > > > > What does it mean for `ExtractWorked` to be false?
> > > > > > If this can be triggered by malformed input, even an asserts-enabled parser needs to handle it gracefully. Asserts should only be used for internal consistency checks that may never fail.
> > > > > >If this can be triggered by malformed input, even an asserts-enabled parser needs to handle it gracefully. 
> > > > > 
> > > > > That's what I mean in the commit message with:
> > > > > 
> > > > > > We maintain the already existing assumption that failures never happen, but now
> > > > > > make it explicit with an assert.
> > > > > 
> > > > > Today, if we fail to extract, we're going to start producing completely incorrect data and be silent about it. Imagine one of the iteration fails: the offset doesn't get updated and the next iteration is going to re-use the offset of the previous iteration.
> > > > > 
> > > > > IMO we're not handling it gracefully today, that's why I added the assert. But I don't feel strongly about this (just a bit :) )
> > > > > 
> > > > > I think the only way to fix this correctly is to remove the iterator abstraction and use callbacks instead. Non-end iterators should never fail the dereference operator, but data extracts are inherently a "fail-able" abstraction.
> > > > I believe @lhames and I spoke about fallible iterators previously & he's used some iterator abstractions with fallibility (which can be nicer than a callback - for instance, allowing early exit or other novel iteration sequences that a callback API might not provide in a general way) - by having the iterator terminate early (when it hits an error, it becomes == end) and having a member you have to inspect after the fact to query for the error state/if it's failed.
> > > >  and having a member you have to inspect after the fact to query for the error state/if it's failed.
> > > 
> > > That's an interesting idea, but how would that work with range-based for loops, as they don't expose the iterator itself to the programmer?
> > > 
> > > Are you aware of examples of such iterators inside LLVM?
> > > 
> > > > having the iterator terminate early 
> > > 
> > > Note how this also forces the iterator to know how to set itself to the end iterator, which is ok but requires more bookkeeping.
> > > 
> > > 
> > > One other possibility we could try is to have the iterator's `operator *` return an optional. However, the current implementation of these iterators (both before and after this patch) avoid memory operations by keeping an `Entry` inside the iterator; the optional would inevitably have copy that. Unless we use pointers as optionals? (i.e. `operator*` returns an `Entry*`) 
> > > 
> > > While what I'm saying applies only for apple_names, the standardized `debug_names` implementation seems to follow a similar pattern, though the code there is even more complex and I haven't yet been able to digest it.
> > > 
> > > Another thing we could do is to eagerly read all the data (except maybe for the strings) from the section and build a proper data structure with the FormValues and hashes and offsets already parsed. So we never have a chance to error again after the constructor is finished. That said, I suspect the original implementors didn't do this for performance / memory reasons? It's interesting to consider how some queries from LLDB force us to search/materialize all of the table anyway. For example, by search by regex (gotta read all the strings) or by DIE whose offset is inside a given range (gotta read all the HashData to find the DIE offset)
> > Oh, actually - maybe we settled on a different idiom - an out parameter from the range accessor, that's compatible with range-for loops, it's even documented: https://llvm.org/docs/ProgrammersManual.html#building-fallible-iterators-and-iterator-ranges
> > 
> > & yeah, certainly one of Lang's goals with libObject was to make it as lazy as possible so as not to do any extra work/hurt performance but also to be as usable as possible on potentially invalid object files - be able to parse all the parts that can be parsed, without tripping over on the first failure and being unable to continue.
> > 
> > The LLVM libDebugInfoDWARF code would probably benefit from similar goals - same idea that it might be used to investigate invalid inputs, and being able to keep going as much as possible/not trip over unrelated errors would be a valuable feature.
> Oh very interesting, I think we can make this work on top of the cleanups of this patch, so I will give it a try as a follow-up patch. Thank you for the references!
> 
> >  one of Lang's goals with libObject was to make it as lazy as possible so as not to do any extra work/hurt performance
> 
> Maybe I'm being naive here, but it's not clear to me whether being lazy (for the bucket/hash/offset/string relocation resolving) is the right call, since we don't cache anything for future use. Granted, if we do `<= 1` queries that visit the entire table, then it is a win. But the moment we do the second query, it would have been faster to rely on a pre-processed data structure, as these data extractors are heavy-weight.
With LLDB as the consumer in mind there are two scenarios we want to optimize for:

1. debugging with a Mach-O debug map: Traversing through thousands of .o files when looking up a DIE by name, most of the time getting a "no" result

2. lookups in a single, very large accelerator table (.dSYMs, ELF binaries, ...)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152159/new/

https://reviews.llvm.org/D152159