[PATCH] D52300: [clangd] Implement VByte PostingList compression

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue Sep 25 02:49:54 PDT 2018


sammccall added a comment.

Mostly looks good, a few further simplifications...



================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:59
+    CurrentID = std::lower_bound(CurrentID, std::end(DecompressedChunk), ID);
+    // Return if the position was found in current chunk.
+    if (CurrentID != std::end(DecompressedChunk))
----------------
I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we can his this case.

Maybe extract `normalizeCursor()` here, with

```
// If the cursor is at the end of a chunk, place it at the start of the next chunk.
void normalizeCursor() { ... }
```

This can be shared with advance().



================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:116
+  decltype(Chunks)::const_iterator CurrentChunk;
+  llvm::SmallVector<DocID, Chunk::PayloadSize + 1> DecompressedChunk;
+  // Iterator over DecompressedChunk.
----------------
Comment the invariants here, e.g.
```
// If CurrentChunk is valid, then DecompressedChunk is CurrentChunk->decompress()
// and CurrentID is a valid (non-end) iterator into it.
```


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:128
+  // of meaningful bytes divided by Chunk::BitsPerEncodingByte.
+  size_t Width = (sizeof(DocID) * 8 - llvm::countLeadingZeros(Delta) +
+                  Chunk::BitsPerEncodingByte - 1) /
----------------
This gives the wrong answer for zero. So assert Delta != 0?


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:128
+  // of meaningful bytes divided by Chunk::BitsPerEncodingByte.
+  size_t Width = (sizeof(DocID) * 8 - llvm::countLeadingZeros(Delta) +
+                  Chunk::BitsPerEncodingByte - 1) /
----------------
sammccall wrote:
> This gives the wrong answer for zero. So assert Delta != 0?
nit: `int` or `unsigned`, not `size_t`


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:128
+  // of meaningful bytes divided by Chunk::BitsPerEncodingByte.
+  size_t Width = (sizeof(DocID) * 8 - llvm::countLeadingZeros(Delta) +
+                  Chunk::BitsPerEncodingByte - 1) /
----------------
sammccall wrote:
> sammccall wrote:
> > This gives the wrong answer for zero. So assert Delta != 0?
> nit: `int` or `unsigned`, not `size_t`
`Width = 1 + findLastSet(Delta) / 7`


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:138
+                 << Chunk::BitsPerEncodingByte * I;
+    uint8_t Encoding = (Delta & Mask) >> (Chunk::BitsPerEncodingByte * I);
+    bool HasNextByte = I + 1 != Width;
----------------
The mask doesn't need to vary, just apply it after shifting.
But really this loop is much clearer if you modify delta in place.

```
do {
  Encoding = Delta & 0x7f;
  Delta >>= 7;
  Payload.front() = Delta ? Encoding : 0x80 | Encoding;
  Payload = Payload.drop_front();
} while (Delta != 0);
```


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:171
+  llvm::SmallVector<Chunk, 1> Result;
+  std::array<uint8_t, Chunk::PayloadSize> Payload;
+  Payload.fill(0);
----------------
Why not just declare a chunk here (or use Result.back() and work in place)?
```
Result.emplace_back();
DocID Last = Result.back().Head = Documents.front();
MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload;
for (DocID Doc : Documents.drop_front()) {
  // no need to handle I == 0 special case.
  if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk
    Result.emplace_back();
    Result.back().Head = Doc;
    RemainingPayload = Result.back().Payload;
  }
  Last = Doc;
}
```

more values, fewer indices


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:173
+  Payload.fill(0);
+  llvm::MutableArrayRef<uint8_t> PayloadRef(Payload);
+  size_t HeadIndex = 0;
----------------
`PayloadRef` doesn't really describe the function of this variable. Suggest `RemainingPayload` or `EmptyPayload` or so


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:192
+/// Reads variable length DocID from the buffer and updates the buffer size. If
+/// the stream is terminated, return None.
+llvm::Optional<DocID> readVByte(llvm::ArrayRef<uint8_t> &Bytes) {
----------------
nit: if the stream is terminated, **consumes all bytes and** returns None.


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:41
+/// decompressed upon request.
+struct Chunk {
+  // Keep sizeof(Chunk) == 32.
----------------
sammccall wrote:
> With the current implementation, this doesn't need to be in the header.
> (the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor)
> 
> (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)
(meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:41
+/// decompressed upon request.
+struct Chunk {
+  // Keep sizeof(Chunk) == 32.
----------------
sammccall wrote:
> sammccall wrote:
> > With the current implementation, this doesn't need to be in the header.
> > (the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor)
> > 
> > (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)
> (meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)
> (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)

You changed this to SmallVector - what were the measurements?
(SmallVector is going to be bigger than Vector whenever you overrun, so it's worth checking)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:39
 
+/// Chunk is a fixed-width piece of PostingList which contains the first DocID
+/// in uncompressed format (Head) and delta-encoded Payload. It can be
----------------
mark as an implementation detail so readers aren't confused.


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:46
+  /// Number of DocID bits in each encoding byte.
+  static constexpr size_t BitsPerEncodingByte = 7;
+
----------------
this is an implementation detail, move to cpp file (or just inline)


================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.h:50
+
+  /// The first element of
+  DocID Head;
----------------
??


https://reviews.llvm.org/D52300





More information about the cfe-commits mailing list