[PATCH] D54554: [PDB] Add symbol records in bulk

Mon Nov 19 13:56:58 PST 2018

rnk added inline comments.

================
Comment at: lld/COFF/PDB.cpp:1078
+  ArrayRef<uint8_t> BulkSymbols;
+  cantFail(forEachCodeViewRecord<CVSymbol>(
+      SymsBuffer, [&](CVSymbol Sym) -> llvm::Error {
----------------
zturner wrote:
> rnk wrote:
> > zturner wrote:
> > > rnk wrote:
> > > > zturner wrote:
> > > > > I feel like this is do-able with just less than 2 passes.
> > > > > 
> > > > > What if we kept a vector of offsets into the `SymData` buffer, and the offsets would represent the position of records which required an alignment adjustment.  If you get through the loop and the vector is empty, then that is the same as `NeedsAlignment == false`.
> > > > > 
> > > > > As you iterate the records, you remap the type indices in the original buffer, which may still require alignment, but you don't actually do any re-alignment yet.  For each record that does require realignment, you push its offset onto the vector.  Note that you can collapse adjacent alignments, so for example if you have Record 1 which is at offset 3 mod 4, and Record 2 is also offset 3 mod 4, then only 1 re-alignment is required.  Because as soon as you realign the output buffer to 0 mod 4, then both records will be re-aligned.
> > > > > 
> > > > > So you're basically pushing offsets onto the vector when offset % 4 changes.
> > > > > 
> > > > > At the end of this process, all remapping has been done, but no re-alignment has occurred.
> > > > > 
> > > > > If the offset vector is empty, you're done.  Just return, and you only made 1 pass over the records.
> > > > > 
> > > > > If the offset vector is not empty, you walk it, and for each offset you add Offset[n] - Offset[n-1] bytes of padding, then copy in bulk up to the next offset.
> > > > > 
> > > > > Now, in the best case scenario, you only have 1 pass over the inputs, but in the worst case scenario, you don't have O(2n), you have O(n+m) where m < n.
> > > > I like the idea of remapping in place in the first pass, and doing the realignment as an optional second pass, that's a good improvement. This is also similar to what I described below, starting with "My plan". There are basically two cases:
> > > > 1. every record is aligned (not implemented yet)
> > > > 2. most records are unaligned
> > > > Your proposal requires memcpying every unaligned record in a second pass, but it uses a temporary vector on the side to hold the offsets. I don't think the vector is needed, we can just iterate again with `forEachCodeViewRecord`.
> > > > 
> > > > Basically, the plan is to combine the two loops that we have here, and do a second pass to realign. The drawback is that we have to "filter" records into global and module streams during the second pass if it runs, which means we're doing "wasted" work on the first pass if the object file isn't prealigned.
> > > > 
> > > > I don't want to implement this optimization in this patch. I'd rather combine it with an LLVM change to pre-align the records, do what we've described here, and benchmark that by itself to see if it's worth the complexity.
> > > > 
> > > > What we have here with the up-front pass is probably fastest for unaligned symbols, so IMO we should stick with it for now.
> > > Well the whole point was to avoid calling `forEachCodeViewRecord` twice, because a) it's slow since you can't know where record N+1 is until you've parsed record N, and b) There are more CodeView records than there are unaligned records.  You could imagine a `.debug$S` section with, say, a few hundred records in it, only 1 of which is unaligned.  So if you call `forEachCodeViewRecord` twice, that's 200 iterations (iterate all 100, then iterate all 100 again), whereas with my proposal it's at most 103 (iterate all 100 and do remappings, memcpy the first N, memcpy the N+1'th record after alignment, memcpy the last 100-N-1 records)
> > > a) it's slow since you can't know where record N+1 is until you've parsed record N, and
> > 
> > Unless we start doing things in parallel, I don't think that matters, except for the benefit you mentioned in 'b'.
> > 
> > > b) There are more CodeView records than there are unaligned records. 
> > 
> > Sure, but by some (probably low) constant factor. For example, variables are described by S_LOCAL/S_DEFRANGE pairs. I'd argue local variable names are randomly distributed, so 75% of the time an S_LOCAL is going to need realignment, and S_LOCALs are probably the majority of symbol records in a debug build.
> > 
> > So, the vector of unaligned records will still be O(#symrecs), so we might as well save memory and iterate from the top.
> > 
> > And, to realign, we either have to copy every byte of symbol data to make it contiguous, or it becomes discontiguous, with lots of holes. What I noticed is that the average symbol record is really small, so it's better to do the copy and point to the entire .debug$S section in bulk.
> > 
> > Basically, I don't think there's anything inherently slow about `forEachCodeViewRecord`.
> It isn't so much about `forEachCodeViewRecord` being slow, it's more that doing 1 large memcpy is faster than doing many small memcpys.  So if you can batch up the memcpy calls into the smallest possible set of calls that is both necessary and sufficient, I think it will give you a good performance gain, and that looking at it terms of O(2n) being "equivalent" to O(n) is going to be misleading.
> 
> That said, I'm ok with deferring that optimization to a later patch, but I do think it's worth trying, and probably doesn't add much (if any) complexity.
Fair enough, saying something is a "constant" factor is always pretty lame. 2x faster is only a "constant" of "2" faster, but it means a lot in reality. We'll have to consider it in the next round of optimization.

https://reviews.llvm.org/D54554