[llvm] [IRPGO][ValueProfile] Instrument virtual table address that could be used to do virtual table address comparision for indirect-call-promotion. (PR #66825)

Thu Oct 12 11:35:08 PDT 2023

minglotus-6 wrote:

+1 that a llvm discourse thread is more formal to continue the discussion of AutoFDO profile format. 

To clarify a few questions, reply inline with input from Snehasish.

> > Currently we are exploring the design space to accommodate the variety of platforms and FDO types we use internally. This is a priority for us though so we should have some updates to share externally by the end of the year.
> 
> Looking forward to it. We will see if it's compatible with what we have internally too. Hopefully they all converge naturally.
> 
> > It's interesting that you are exploring a profile-guided direction for this. Chatelet et. al (from Google) published a paper on automatic generation of memcpy which uses PMU based parameter profiling at ISMM'21. The technique does not use Intel DLA instead we use precise sampling on call instructions in the process and filter the functions of interest. We inspect the RDX register to collect the parameter value for size. The data in aggregate was used to auto-generate the memcpy implementation.
> 
> Did you end up using that for optimizing your workload? 

The mem*  (memcpy and other mem* functions) optimization has been rolled out in the fleet for years. 

> 1% from that is quite a lot. I don't think we have similar opportunity space for our workload given overall mem* cycles % is low, and there are also long copies where such optimization won't help much (spend most cycles doing actual copies rather than traversing the decision tree of memcpy)

>From section 4.3 of the paper, _using this version of memcpy and associated memcmp and memset improved the throughput of one of our main ser- vices by +0.65% ± 0.1% Requests Per Second (RPS) compared to the default shared glibc. Overall we estimate that this work improves the performance of the fleet by 1%_ .  Wins also comes from reducing PLT overhead. Meanwhile, as David mentions above, hardware (e.g., `rep movsb`) keeps improving with newer generations of machines.

> 
> > Section 2.4 has the rationale for not using an FDO approach. @gchatelet is the primary owner for this work.
> 
> We use FDO and allow more inline expansion of mem* to generate optimized memcpy decision tree for given size histogram/value range. So we bypass mem* libcall, hence there's no need to modify code in asm or from lib.
> 
> > Instrumentation FDO currently has memcpy size value profiling and specialization, but it is quite limited due to 1) it only specializes on single values, not on ranges; 2) due to the lack of context sensitivity, there are not many sites that can be specialized.
> 
> > With context sensitive range based specialization (for optimal dispatch), I expect the performance effect to be larger. However, the hardware acceleration (e.g. rep mov) may also get better over time, eating away software optimization benefit.
> 
> The current memsizeopt from IRPGO is mostly flat for one of our workload. Our prototype for sample PGO did handle range instead of just single constant value, and we have some context-sensivity.

> > @htyu Pretty much a drive-by question if it's convenient for you to share more, how ranges are selected out of sampled values? For example, are ranges the same for all workloads or dynamically generated based on the distribution of size from the per-workload profile data?
> 
> We group sampled values by ranges pretty much based on the current `folly::memcpy` implementation:
> 
> ```
> 0 
> 1 
> [2,3] 
> [4,7] 
> [8,16] 
> [17,32] 
> [33,64] 
> [65,128] 
> [129,inf] 
> ```
> 
> Values in a range share the same memcpy code, i.e, a pair of forward and backward copies.
> 
> The range layout is fixed and hardcoded in LLVM. The compiler can choose to prioritize a specific range based on the value profile it sees. The profile is just like a LBR profile which is collected per service.
> 
> So at compile time we may have such transformation:
> 
> ```
> memcpy(src, dst, size)
> ```
> 
> =>
> 
> ```
>    if (33 <= size <= 64)
>       vmovups  (%rsi), %ymm0
>       vmovups  -32(%rsi,%rdx), %ymm1
>       vmovups  %ymm0, (%rdi)
>       vmovups  %ymm1, -32(%rdi,%rdx)
>    else
>      call memcpy(src, dst, size)
> ```

Thanks for the illustration.  [14] of the paper (from @gchatelet ) references the open-sourced code that does [range-based specializations](https://github.com/llvm/llvm-project/commit/04a309dd0be3aea17ab6e84f8bfc046c1f044be2#diff-b9009e1e52f55c9d8a82abb79b0be391c241601c6752e52debe22f56b6e61d6bR1).

I was asking about range selection since it matters for another use case that I worked in the past, where the latency and additional padding to form ranges matters at runtime. This is not the case for memcpy.

https://github.com/llvm/llvm-project/pull/66825