[PATCH] D137657: [DWARFLibrary] Add support to re-construct cu-index

Thu Nov 10 22:20:00 PST 2022

clayborg added inline comments.

================
Comment at: llvm/lib/DebugInfo/DWARF/DWARFContext.cpp:819-820
+      // to do anything.
+      if (Header.getVersion() == 4 && type == IndexType::TUIndex)
+        break;
+
----------------
ayermolo wrote:
> clayborg wrote:
> > ayermolo wrote:
> > > clayborg wrote:
> > > > ayermolo wrote:
> > > > > dblaikie wrote:
> > > > > > dblaikie wrote:
> > > > > > > ayermolo wrote:
> > > > > > > > dblaikie wrote:
> > > > > > > > > ayermolo wrote:
> > > > > > > > > > dblaikie wrote:
> > > > > > > > > > > I'm a bit more worried about implementing this for DWARFv4 due to needing to rebuild .debug_abbrev sections together, which is less reliable/guaranteed (there's no guarantee that the abbrev contributions are written in the same order as the .debug_info sections - though it's the case in reality I guess) than .debug_info parsing.
> > > > > > > > > > > 
> > > > > > > > > > > Any chance this workaround can be restricted to only DWARFv5?
> > > > > > > > > > I would prefer to keep it more generic and include DWARF4. Part of the reason is that we still heavily rely on DWARF4.
> > > > > > > > > > One thing we can do is that if DWOID is "garbage". As in when we populate CU Index and can't find the signature in a map we stop.
> > > > > > > > > > With the current implementation we can just clear the map and parse CU/TU index as before. If we move implementation of updating contributions entires after parsing we just stop updating further. I think with this approach we can handle the common case of how things are now in reality, and on of chance the assumption fails we are no worse than we were before. Although it does assume that with wrong abbrev. getDWOID doesn't crash.... What do you think?
> > > > > > > > > maybe a way to make this more robust for DWARFv4 would be to parse the index as-written first, and using the index entries to find the suitable .debug_abbrev contribution for the CU (found via the manual walk, associated via the DWO ID)
> > > > > > > > > 
> > > > > > > > > As for the fallback behavior in case of "garbage" DWO IDs - given how tenuous all this already is, perhaps we should stop/error if that happens, rather than producing corrupted data?
> > > > > > > > Sorry, not sure I follow. When we manually walk there is implicit assumption that .debug_info.dwo section and corresponding .debug_abbrev.dwo are in the same order. Thus we can use abbrev to parse unit die to get DWO ID (since this is DWARF4 and it's one of the attributes). Then use DWO ID to figure out which contribution we need to modify.
> > > > > > > > 
> > > > > > > > When we parse Index CUs are in the order in which DWOIDs got hashed into the table. Since .debug_info offsets column is corrupt (at least for some), we still can't map the abbrev section to the corresponding debug section.
> > > > > > > > 
> > > > > > > > The sentiment from users is better to have some/most of debug information than none. Which is why we have dwp with over 4GB of debug info even thought in trunk it's an error in llvm-dwp....
> > > > > > > > 
> > > > > > > Ah, right, right - circular problem. Sorry.
> > > > > > > 
> > > > > > > I'm still a bit worried about the complexity of supporting v4 here... not sure it's the right thing to do.
> > > > > > I guess we could at least validate that the abbrev offset was probably correct - once we have the DWO ID we could look up the index entry, find the abbrev section offset, and confirm it was the one we used?
> > > > > We could. For the case where we use wrong abbrev, interpret some random 8 bytes as DWO ID, and they happen to be same ones as a valid signature?
> > > > Both solutions are a bit messy. We must get the abbrev offset correct somehow. If we parse the CU index first, you can most probably identify the right .debug_abbrev offset. It will be very doubtful that you will have two 32 bit CU offsets that truncate to the same offset. It could happen, but very unlikely. So I would rather we don't assume anything about .debug_abbrev layout and parse the standard 32 bit CU index first, and then to find the .debug_abbrev offset, we find the truncated 32 bit CU offset in the list. 
> > > Not sure I follow. Only algo I could come up with that would do it is:
> > > 
> > > ```
> > > Parse headers and have a map from truncated and normal ones under 4GB to 64bit real offsets. In case of collision
> > > Parse the CU Index, and iterate over it.
> > > As we do it, we find a cu offset in a map 
> > >    get its real 64 bit offsets. 
> > >    Get abbrev offset from cu index
> > >    Until DWO ID matches signature iterate over 64 bit real offsets
> > >        Use it to parse UnitDIE and check
> > >        update cu index with real 64bit offset if check passes
> > > ```
> > > 
> > > *Assumption is that parsing CU with wrong abbrev won't just crash somewhere
> > > 
> > > or did you had something else in mind?
> > >    
> > > 
> > > 
> > yeah, my flow would be:
> > - parse CU index using previous code and let 32 bit CU offsets get truncated
> > - if (sizeof(.debug_info|types) > 4GB) then call this function
> > - parse all CU headers
> >   - find the current CU's .debug_abbrev offset by searching truncated map from first step using a truncated 32 bit offset
> >   - parse the CU's first DIE using the correct abbreviation table
> To confirm the last step so we can verify DWO ID in a UnitDIE is the same as a signature in contribution?
You would only need to do this if you have multiple CUs whose truncated offset is in the parsed CU index. If there is only one match, we are good to go. So maybe a better flow is:

- parse CU index using previous code and let 32 bit CU offsets get truncated
- if (sizeof(.debug_info|types) <= 4GB) then we are done
- parse all CU headers
  - search the CU index for a truncated offset that matches the current CU's truncated 64 bit offset
    - If the there is only one match, just fix the truncated CU offset and continue, no need to read the CU DIE at all for the DWO ID
    - if there are multiple matches, then you need to try and get the the right abbreviation so you can verify the DWO ID. If we have the wrong abbrev table, then we might get decoding errors, or we might not, so this will be the tricky part to get right. You could rely on ordering etc to help try and get the right abbrev table as you were doing, but you only need to do this _if_ you have multiple matches.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137657/new/

https://reviews.llvm.org/D137657