[PATCH] D52300: [clangd] Implement VByte PostingList compression
Sam McCall via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Tue Sep 25 02:49:54 PDT 2018
sammccall added a comment.
Mostly looks good, a few further simplifications...
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:59
+ CurrentID = std::lower_bound(CurrentID, std::end(DecompressedChunk), ID);
+ // Return if the position was found in current chunk.
+ if (CurrentID != std::end(DecompressedChunk))
----------------
I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we can his this case.
Maybe extract `normalizeCursor()` here, with
```
// If the cursor is at the end of a chunk, place it at the start of the next chunk.
void normalizeCursor() { ... }
```
This can be shared with advance().
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:116
+ decltype(Chunks)::const_iterator CurrentChunk;
+ llvm::SmallVector<DocID, Chunk::PayloadSize + 1> DecompressedChunk;
+ // Iterator over DecompressedChunk.
----------------
Comment the invariants here, e.g.
```
// If CurrentChunk is valid, then DecompressedChunk is CurrentChunk->decompress()
// and CurrentID is a valid (non-end) iterator into it.
```
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:128
+ // of meaningful bytes divided by Chunk::BitsPerEncodingByte.
+ size_t Width = (sizeof(DocID) * 8 - llvm::countLeadingZeros(Delta) +
+ Chunk::BitsPerEncodingByte - 1) /
----------------
This gives the wrong answer for zero. So assert Delta != 0?
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:128
+ // of meaningful bytes divided by Chunk::BitsPerEncodingByte.
+ size_t Width = (sizeof(DocID) * 8 - llvm::countLeadingZeros(Delta) +
+ Chunk::BitsPerEncodingByte - 1) /
----------------
sammccall wrote:
> This gives the wrong answer for zero. So assert Delta != 0?
nit: `int` or `unsigned`, not `size_t`
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:128
+ // of meaningful bytes divided by Chunk::BitsPerEncodingByte.
+ size_t Width = (sizeof(DocID) * 8 - llvm::countLeadingZeros(Delta) +
+ Chunk::BitsPerEncodingByte - 1) /
----------------
sammccall wrote:
> sammccall wrote:
> > This gives the wrong answer for zero. So assert Delta != 0?
> nit: `int` or `unsigned`, not `size_t`
`Width = 1 + findLastSet(Delta) / 7`
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:138
+ << Chunk::BitsPerEncodingByte * I;
+ uint8_t Encoding = (Delta & Mask) >> (Chunk::BitsPerEncodingByte * I);
+ bool HasNextByte = I + 1 != Width;
----------------
The mask doesn't need to vary, just apply it after shifting.
But really this loop is much clearer if you modify delta in place.
```
do {
Encoding = Delta & 0x7f;
Delta >>= 7;
Payload.front() = Delta ? Encoding : 0x80 | Encoding;
Payload = Payload.drop_front();
} while (Delta != 0);
```
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:171
+ llvm::SmallVector<Chunk, 1> Result;
+ std::array<uint8_t, Chunk::PayloadSize> Payload;
+ Payload.fill(0);
----------------
Why not just declare a chunk here (or use Result.back() and work in place)?
```
Result.emplace_back();
DocID Last = Result.back().Head = Documents.front();
MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload;
for (DocID Doc : Documents.drop_front()) {
// no need to handle I == 0 special case.
if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk
Result.emplace_back();
Result.back().Head = Doc;
RemainingPayload = Result.back().Payload;
}
Last = Doc;
}
```
more values, fewer indices
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:173
+ Payload.fill(0);
+ llvm::MutableArrayRef<uint8_t> PayloadRef(Payload);
+ size_t HeadIndex = 0;
----------------
`PayloadRef` doesn't really describe the function of this variable. Suggest `RemainingPayload` or `EmptyPayload` or so
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:192
+/// Reads variable length DocID from the buffer and updates the buffer size. If
+/// the stream is terminated, return None.
+llvm::Optional<DocID> readVByte(llvm::ArrayRef<uint8_t> &Bytes) {
----------------
nit: if the stream is terminated, **consumes all bytes and** returns None.
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:41
+/// decompressed upon request.
+struct Chunk {
+ // Keep sizeof(Chunk) == 32.
----------------
sammccall wrote:
> With the current implementation, this doesn't need to be in the header.
> (the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor)
>
> (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)
(meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:41
+/// decompressed upon request.
+struct Chunk {
+ // Keep sizeof(Chunk) == 32.
----------------
sammccall wrote:
> sammccall wrote:
> > With the current implementation, this doesn't need to be in the header.
> > (the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor)
> >
> > (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)
> (meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)
> (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)
You changed this to SmallVector - what were the measurements?
(SmallVector is going to be bigger than Vector whenever you overrun, so it's worth checking)
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:39
+/// Chunk is a fixed-width piece of PostingList which contains the first DocID
+/// in uncompressed format (Head) and delta-encoded Payload. It can be
----------------
mark as an implementation detail so readers aren't confused.
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:46
+ /// Number of DocID bits in each encoding byte.
+ static constexpr size_t BitsPerEncodingByte = 7;
+
----------------
this is an implementation detail, move to cpp file (or just inline)
================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:50
+
+ /// The first element of
+ DocID Head;
----------------
??
https://reviews.llvm.org/D52300
More information about the cfe-commits
mailing list