[LLVMdev] RFC: Binary format for instrumentation based profiling data

Mon Mar 24 15:26:27 PDT 2014

On Mar 24, 2014, at 12:29 PM, Chandler Carruth <chandlerc at google.com> wrote:

> Format 2
> --------
> 
> This format should be efficient to read and preferably reasonably
> compact. We'll convert from format 1 to format 2 using llvm-profdata,
> and clang will use format 2 for PGO.
> 
> Since the only particularly important operation in this use case is fast
> lookup, I propose using the on disk hash table that's currently used in
> clang for AST serialization/PTH/etc with a small amount of metadata in a
> header.
> 
> The hash table implementation currently lives in include/clang/Basic and
> consists of a single header. Moving it to llvm and updating the clients
> in clang should be easy. I'll send a brief RFC separately to see if
> anyone's opposed to moving it.
> 
> I can mention this and we can discuss this on the other thread if you would rather, but I'm not a huge fan of this code. My vague memory was that this was a quick hack by Doug that he never really expected to live long-term.
> 
> I have a general preference for from-disk lookups to use tries (for strings, prefix tries) or other fast, sorted lookup structures. 

These profiles will contain every function in a program.  Relatively few of these will be needed per translation unit (per invocation of clang).  I suspect that an on disk hash will perform better than a trie for this use case, since it requires fewer loads from disk.

But the main benefit of the clang on-disk hash is that it’s in use and it already works.  Unless tries are significantly better, I prefer cleaning up the (working) hash table implementation to implementing (and debugging) something new.

> They have the nice property of being inherently stable and unambiguous, and not baking any hashing algorithm into it.

It *is* harder to keep the hash table stable.  I think it’s worth the cost here.