[PATCH] D52300: [clangd] Implement VByte PostingList compression

Fri Sep 21 01:23:18 PDT 2018

kbobyrev added a comment.

@sammccall thank you for the comments, I'll improve it. @ilya-biryukov also provided valuable feedback and suggested that the code looks complicated now. We also discussed different compression approach which would require storing `Head`s and `Payload`s separately so that binary search over `Head`s could have better cache locality. That can dramatically improve performance.

For now, I think it's important to simplify the code and I'll start with that. Also, your suggestions will help to save even more memory! Zeroed bytes are an example of that: I started the patch with the general encoding API (so that zeros could be encoded in the list), but there's no good reason to keep that assumption now.

Also, I think I got my measurements wrong yesterday. I measured posting lists only: without compression size is 55 MB and with compression (current version, not optimized yet) it's 22 MB. This seems like a huge win. I try to keep myself from being overenthusiastic and double-check the numbers, but it looks more like something you estimated when we used posting list size distribution.

================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:262
+             "Unterminated byte sequence at the end of input stream.");
+  return Result;
+}
----------------
sammccall wrote:
> here I think you're missing a memory optimization probably equal in size to the whole gains achieved by compression :-)
> 
> libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting an extra 30% or so of ram (depending on size distribution, I forget the theory here).
> We should shrink to fit. If we were truly desperate we'd iterate over all the numbers and presize the array, but we're probably not.
> 
> I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\)
Great catch! I have to be careful with `std::vector`s which are not allocated with their final size in advance.

================
Comment at: clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp:1
+//===-- VByteFuzzer.cpp - Fuzz VByte Posting List encoding ----------------===//
+//
----------------
sammccall wrote:
> For better or worse, adding a fuzzer in the open-source project is pretty high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following up on bugs).
> 
> I'm not sure the maintenance cost is justified here. Can we just run the fuzzer but not check it in?
OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push it in the final version.

https://reviews.llvm.org/D52300