[PATCH] D87805: [PDB] Merge types in parallel when using ghashing
Alexandre Ganea via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Sep 18 13:49:12 PDT 2020
aganea added a comment.
Thanks a lot for working on this Reid, really neat!
This is a huge change, how do you validate that the resulting PDB is equivalent to that of the previous version (before patch) or non-ghash merging?
================
Comment at: lld/COFF/DebugTypes.cpp:598
+ // Copy the type into our mutable buffer.
+ assert(ty.length() <= codeview::MaxRecordLength);
+ size_t offset = merged.recs.size();
----------------
Fun fact, microsoft-pdb does (mistakenly?) `<` not `<=`: https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/tpi.cpp#L1130
However it does reserve `cbRecMax` bytes (L942).
================
Comment at: lld/COFF/DebugTypes.cpp:609
+ for (size_t i = ty.length(); i < newSize; ++i)
+ newRec[i] = LF_PAD0 + (newSize - i);
+ }
----------------
We're doing this all over the place, it'd be nice to eventually converge all variants (later).
================
Comment at: lld/COFF/DebugTypes.cpp:807
+namespace {
+/// A concurrent hash table for global type hashing. It is based on this paper:
+/// Concurrent Hash Tables: Fast and General(?)!
----------------
I am wondering if this doesn't belong in a new file? Since the code is quite small, we //could// possibly have different implementations (in the future), depending on the dataset. 32-bit, 64-bit or 128-bit with no ghash indirection (if the CPU supports 128-bit CAS).
================
Comment at: lld/COFF/DebugTypes.cpp:873
+ GloballyHashedType getGHash() const {
+ return TpiSource::instances[getTpiSrcIdx()]->ghashes[getGHashIdx()];
+ }
----------------
I can't help thinking that this smells like Clang's SourceManager index, where all sources all collated into a single index (not a 2D array).
If you did that, it would reduce the size of the cell data to 32-bit, iff we limit ourselves to 2^32 input records.
Am I being too optimistic? ;-)
================
Comment at: lld/COFF/DebugTypes.cpp:932
+ // Advance the probe. Wrap around to the beginning if we run off the end.
+ ++idx;
+ idx = idx == tableSize ? 0 : idx;
----------------
It'd be interesting to collect statistics on how many collisions you get. And also compare linear (current behavior) vs. quadratic probing.
One issue I can see is that since the table will be 99.9% full at the end of the insertion pass, there will lots of collisions toward the end. What about making the table 25% bigger, like DenseHash does?
================
Comment at: lld/COFF/DebugTypes.cpp:974
+ // Cap the table size so that we can use 32-bit cell indices. Type indices are
+ // also 32-bit, so this is an inherent PDB file format limit anyway.
+ tableSize = std::min(size_t(INT32_MAX), tableSize);
----------------
Looks like the limit in the PDB is 28-bit wide indices, probably because the PDB limit is 4 GB and because the smallest type record cannot be less that 8 bytes (4-byte header + 1 byte payload + padding).
https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/dbiimpl.h#L62
In practice, I never saw more that a few tens of millions of type records in a 2-GB PDB. It is very unlikely that we'll ever reach this 28-bit limit.
However in this case you're talking about the cumulative (input) records count, right? That can be pretty large, I've seen 1 billion input type records (when we link our games without Unity/Jumbo files).
How many input type records do you see on your largest EXE/DLL? (we could add the total input type records count to `/summary`)
================
Comment at: lld/COFF/DebugTypes.cpp:1011
+ // - source 1, type 0...
+ std::vector<GHashCell> entries;
+ for (const GHashCell &cell :
----------------
.reserve?
================
Comment at: lld/COFF/DebugTypes.cpp:1093
+}
+;
+ }
----------------
Remove ;
================
Comment at: lld/COFF/PDB.cpp:898
+ // already been merged.
+ if (!config->debugGHashes) {
+ ScopedTimer t(typeMergingTimer);
----------------
After this patch, `/DEBUG:GHASH` could become the new default?
================
Comment at: lld/test/COFF/pdb-procid-remapping.test:1
-# RUN: yaml2obj %p/Inputs/pdb1.yaml -o %t1.obj
-# RUN: yaml2obj %p/Inputs/pdb2.yaml -o %t2.obj
+# RUN: yaml2obj < %p/Inputs/pdb1.yaml > %t1.obj
+# RUN: yaml2obj < %p/Inputs/pdb2.yaml > %t2.obj
----------------
I know @MaskRay recently changed all < to cmd-line input and > to -o. Do you need < > here?
================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:70
- TypeRecords.push_back(Record);
+ TypeRecBuffers.push_back(Record);
+ // FIXME: Require it.
----------------
Since you probably already know how many records/hashes you're inserting, can you `.reserve()` `TypeRecBuffers` and `TypeHashes` in advance?
================
Comment at: llvm/lib/DebugInfo/PDB/Native/TpiStreamBuilder.cpp:93
+
+ TypeRecBuffers.push_back(Types);
+ TypeHashes.insert(TypeHashes.end(), Hashes.begin(), Hashes.end());
----------------
Same here (.reserve).
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D87805/new/
https://reviews.llvm.org/D87805
More information about the llvm-commits
mailing list