[PATCH] D124422: [Serialization] Improve encoding of small SourceRanges.

Tue Apr 26 03:28:06 PDT 2022

sammccall added a comment.

In D124422#3474110 <https://reviews.llvm.org/D124422#3474110>, @ilya-biryukov wrote:

> Yeah, that sounds promising. Various source locations inside the same nodes are probably very close. Probably even across nodes.

Yeah. The reason to avoid compressing across nodes is it breaks lazy-loading. (Also I'm not sure whether related decls end up proximate in the output). Maybe there's a way around it, but this is much lower-hanging I think.

> If we are willing to take on a bit more complexity, we could probably get even bigger wins by storing all source locations in a single array  having indices into that array in the actual positions where we need the source locations.

This is an interesting idea, I wonder whether it will be a win.

The delta-encodings are efficient as you note, but we have to represent SourceLocations twice. (value + index).
I think the #bits we save storing indices vs SourceLocations is similar to the #bits we need to encode the deltas.
If these cancel out then the savings we get are from deduplicating SourceLocations in the array - I suspect there aren't actually that many duplicates, we'd be lucky to halve the array size.

Rough example with made-up numbers:

- today we store raw SourceLocations, N * ~28 bits
- maybe simple delta-encoding maybe reduces this by 8 to N * 20 bits average
- sorted SourceLocation array (no delta-encoding, no deduplication) is N * 28, indices maybe N * 23 (assuming distance of 2^5 between sourcelocations)
- delta-encoding the array means it takes conservatively N * 7 bits (5 bits average delta + VBR overhead + variance)
- delta-encoding the indices yields the same 8 bits as before => N * 15
- deduplcating the halves the entries and reduces the size of indices by 1 bit
- this yields N/2*7 for the array + N * 14 for indices = N * 17.5

So we come out ahead (20->17.5) if deduplication halves #locations. If it reduces by a third, it's 20->19 instead.
This seems likely to be an improvement, over simple delta, but much smaller than delta vs status quo.

> I suspect there are requirements that will force us to store the array in chunks and allow searching for a particular chunk.
>
> However, that's really adding quite a bit of complexity and I'm not sure preamble sizes are such a big deal outside our environment. E.g. even 10% savings are borderline for that kind of complexity.

Yeah, I don't plan to pursue this for now because the complexity/reward scares me, but if we get really squeezed for size we could revisit.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124422/new/

https://reviews.llvm.org/D124422