[PATCH] D137657: [DWARFLibrary] Add support to re-construct cu-index

Thu Nov 10 18:30:01 PST 2022

ayermolo added a comment.

@dblaikie How would you like to proceed?

================
Comment at: llvm/lib/DebugInfo/DWARF/DWARFContext.cpp:819-820
+      // to do anything.
+      if (Header.getVersion() == 4 && type == IndexType::TUIndex)
+        break;
+
----------------
clayborg wrote:
> ayermolo wrote:
> > dblaikie wrote:
> > > dblaikie wrote:
> > > > ayermolo wrote:
> > > > > dblaikie wrote:
> > > > > > ayermolo wrote:
> > > > > > > dblaikie wrote:
> > > > > > > > I'm a bit more worried about implementing this for DWARFv4 due to needing to rebuild .debug_abbrev sections together, which is less reliable/guaranteed (there's no guarantee that the abbrev contributions are written in the same order as the .debug_info sections - though it's the case in reality I guess) than .debug_info parsing.
> > > > > > > > 
> > > > > > > > Any chance this workaround can be restricted to only DWARFv5?
> > > > > > > I would prefer to keep it more generic and include DWARF4. Part of the reason is that we still heavily rely on DWARF4.
> > > > > > > One thing we can do is that if DWOID is "garbage". As in when we populate CU Index and can't find the signature in a map we stop.
> > > > > > > With the current implementation we can just clear the map and parse CU/TU index as before. If we move implementation of updating contributions entires after parsing we just stop updating further. I think with this approach we can handle the common case of how things are now in reality, and on of chance the assumption fails we are no worse than we were before. Although it does assume that with wrong abbrev. getDWOID doesn't crash.... What do you think?
> > > > > > maybe a way to make this more robust for DWARFv4 would be to parse the index as-written first, and using the index entries to find the suitable .debug_abbrev contribution for the CU (found via the manual walk, associated via the DWO ID)
> > > > > > 
> > > > > > As for the fallback behavior in case of "garbage" DWO IDs - given how tenuous all this already is, perhaps we should stop/error if that happens, rather than producing corrupted data?
> > > > > Sorry, not sure I follow. When we manually walk there is implicit assumption that .debug_info.dwo section and corresponding .debug_abbrev.dwo are in the same order. Thus we can use abbrev to parse unit die to get DWO ID (since this is DWARF4 and it's one of the attributes). Then use DWO ID to figure out which contribution we need to modify.
> > > > > 
> > > > > When we parse Index CUs are in the order in which DWOIDs got hashed into the table. Since .debug_info offsets column is corrupt (at least for some), we still can't map the abbrev section to the corresponding debug section.
> > > > > 
> > > > > The sentiment from users is better to have some/most of debug information than none. Which is why we have dwp with over 4GB of debug info even thought in trunk it's an error in llvm-dwp....
> > > > > 
> > > > Ah, right, right - circular problem. Sorry.
> > > > 
> > > > I'm still a bit worried about the complexity of supporting v4 here... not sure it's the right thing to do.
> > > I guess we could at least validate that the abbrev offset was probably correct - once we have the DWO ID we could look up the index entry, find the abbrev section offset, and confirm it was the one we used?
> > We could. For the case where we use wrong abbrev, interpret some random 8 bytes as DWO ID, and they happen to be same ones as a valid signature?
> Both solutions are a bit messy. We must get the abbrev offset correct somehow. If we parse the CU index first, you can most probably identify the right .debug_abbrev offset. It will be very doubtful that you will have two 32 bit CU offsets that truncate to the same offset. It could happen, but very unlikely. So I would rather we don't assume anything about .debug_abbrev layout and parse the standard 32 bit CU index first, and then to find the .debug_abbrev offset, we find the truncated 32 bit CU offset in the list. 
Not sure I follow. Only algo I could come up with that would do it is:

```
Parse headers and have a map from truncated and normal ones under 4GB to 64bit real offsets. In case of collision
Parse the CU Index, and iterate over it.
As we do it, we find a cu offset in a map 
   get its real 64 bit offsets. 
   Get abbrev offset from cu index
   Until DWO ID matches signature iterate over 64 bit real offsets
       Use it to parse UnitDIE and check
       update cu index with real 64bit offset if check passes
```

*Assumption is that parsing CU with wrong abbrev won't just crash somewhere

or did you had something else in mind?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137657/new/

https://reviews.llvm.org/D137657