[llvm] [IRPGO][ValueProfile] Instrument virtual table address that could be used to do virtual table address comparision for indirect-call-promotion. (PR #66825)

Wed Oct 11 22:59:04 PDT 2023

WenleiHe wrote:

> Currently we are exploring the design space to accommodate the variety of platforms and FDO types we use internally. This is a priority for us though so we should have some updates to share externally by the end of the year.

Looking forward to it. We will see if it's compatible with what we have internally too. Hopefully they all converge naturally. 

> It's interesting that you are exploring a profile-guided direction for this. Chatelet et. al (from Google) published a paper on automatic generation of memcpy which uses PMU based parameter profiling at ISMM'21. The technique does not use Intel DLA instead we use precise sampling on call instructions in the process and filter the functions of interest. We inspect the RDX register to collect the parameter value for size. The data in aggregate was used to auto-generate the memcpy implementation. 

Did you end up using that for optimizing your workload? 1% from that is quite a lot. I don't think we have similar opportunity space for our workload given overall mem* cycles % is low, and there are also long copies where such optimization won't help much (spend most cycles doing actual copies rather than traversing the decision tree of memcpy).

> Section 2.4 has the rationale for not using an FDO approach. @gchatelet is the primary owner for this work.

We use FDO and allow more inline expansion of mem* to generate optimized memcpy decision tree for given size histogram/value range. So we bypass mem* libcall, hence there's no need to modify code in asm or from lib. 

> Instrumentation FDO currently has memcpy size value profiling and specialization, but it is quite limited due to 1) it only specializes on single values, not on ranges; 2) due to the lack of context sensitivity, there are not many sites that can be specialized.

> With context sensitive range based specialization (for optimal dispatch), I expect the performance effect to be larger. However, the hardware acceleration (e.g. rep mov) may also get better over time, eating away software optimization benefit.

The current memsizeopt from IRPGO is mostly flat for one of our workload. Our prototype for sample PGO did handle range instead of just single constant value, and we have some context-sensivity.

https://github.com/llvm/llvm-project/pull/66825