[PATCH] D153692: Fixed D147740 - [llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map

Mon Jun 26 17:09:50 PDT 2023

wenlei added a comment.

In D153692#4450504 <https://reviews.llvm.org/D153692#4450504>, @huangjd wrote:

>> I don't think there's much we can do still, even if there's collision report. Maybe we can revert back to use full name instead of MD5, if that turns out to be a problem. But as you mentioned we already support MD5 in profile generation.
>
> If a collision is rare while there is a strong demand for correctness by users, we can add some ad-hoc logic to deal with that, which is still faster than using a multimap, and much faster than using the full function name. If collision happens frequently, then we have no choice but to revert back to use full name, but this is extremely unlikely. I couldn't find any research on partial MD5 (we only use 64-bit of it) collision using ASCII strings, so that's why I am not ruling out this out of caution.

I'm still not convinced that we need to detect or report collision, It's not a correctness issue when collision happens - one of the function will just lose its profile. It's going to be very rare for function name to hit collision (we're not talking about hashing the entire function which is mapping a much bigger universe into an int), and it's going to be even less likely for a collision to cause noticeable perf loss. That said, if there's concern around collision, maybe we should verify this assumption by building large code base and see whether/how often it happens as a one-off testing for this change, instead of building collision detection into the final implementation (I doubt we would actually get any report even if it happens, if the report is debug only).

>> Can you elaborate? Is that critical or just for convenience?
>
> It's to enforce OOP practice. The optimization passes see SampleProfileMap, which is supposed to map Functions to SampleProfiles. It should not care how keys are actually represented, and it's much more cleaner to write profiles.find(FuncName) than profiles.find(MD5Hash(FuncName)). The latter has another potential risk of misuse, because there already exist multiple ways to get a string's hash value: MD5Hash, llvm::hash_value, std::hash, and they are all different, we have to make sure the correct hash function is used at all times.

Ok, I think this is fair - hiding hashing details behind the scene is reasonable. But if that's the intention, SampleProfileMap can be a very thin wrapper on top of existing container.

>> FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.
>
> Yes, I am aware of that. I am planning to change it to HashKeyMap<std::map, StringRef, uint64_t> as well (or unordered_map, depends on which one is more beneficial. After that I also have a plan to use specialized data structure for CallTargetMap and FunctionSamplesMap because it's true that they have zero or one entries for almost all the cases.

Do you plan to change CallTargetMap to use MD5 as key as well? This will be different from SampleProfileMap in the sense that SampleProfileMap has function name as part of its value (function samples), but for CallTargetMap, its value is just a count, so if we change its key to MD5, we won't be able to recover "original" key from its value for debugging etc.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153692/new/

https://reviews.llvm.org/D153692