[PATCH] D120230: [SelectOpti][1/4] Setup new select-optimize pass

Mon Mar 14 22:49:02 PDT 2022

davidxl added a comment.

In D120230#3381568 <https://reviews.llvm.org/D120230#3381568>, @Amir wrote:

> In D120230#3381552 <https://reviews.llvm.org/D120230#3381552>, @davidxl wrote:
>
>>> On that internal workload, we've got 6% less cmov with this pass turned on for IRPGO (it works, no correctness issue :-) ). But perf-wise it's neutral (we can measure 0.2% perf movement on that workload with high confidence).
>>
>> Does BOLT's cmov optimization improve performance for this workload?
>
> I didn't measure it yet, but unlikely (see comment below).
>
>>> Say we end up with cmov in one of the sample PGO iterations (either due to lack of profile, or profile indicating branch being unbiased), we would lose the control flow profile that is needed to tell how biased that original branch is, because we've turned that control flow into data flow. Unless we never use cmov for branches without profile info, we could keep generating cmov in future iterations even if branch becomes more biased later because we will never get control flow profile again.
>>>
>>> If we indeed never use cmov for branch without profile, that turn this problem into a typical sample PGO oscillations. That is not the case before this patch set, are we changing the behavior now? I'm also not sure if such oscillation is as easily mitigable as other oscillations like those from speculative ICP.
>>
>> Regarding BOLT's usage for this problem -- does it mean the profile data is not collected from production binary but collected using pre-BOLD binary in a training run?
>
> Yes, the profile data should be collected from pre-BOLT binary.
>
>> If this is the setup, compiler can choose to minimze cmov generation for the sake of better profilling.
>
> The compiler can indeed choose to minimize cmov generation – I've recently added an LLVM knob to force-expand all cmov's in D119777 <https://reviews.llvm.org/D119777> (x86-cmov-converter-force-all).
>
> However, the data collected with (non-PGO, non-LTO) clang binary suggests that x86-cmov-converter-force-all introduces a significant perf regression that BOLT's CMOV conversion with default heuristics is unable to recover from.

I assume BOLT's block layout lays out those branchy code properly, right?

> BOLT converts back a minor percentage (~5%) of eligible hammocks based on execution and misprediction heuristics (>5% misprediction rate, >1% biased condition).

Only 5% of the hammock based execution meets the conversion criteria or 5% of the candidates matching the criteria can be converted back ?

> The hypothesis is that force-expanding cmov's results in 1) a code size increase, 2) more branches => higher pressure on BPU structures, and given that BOLT converts back only a small part of hammocks back, these factors result in a net regression.
>
> In other words, misprediction rate may not be the most important factor in hammock-vs-cmov tradeoff for large code footprint workloads. I believe that a holistic approach (criticality + misprediction rate + code size) may yield better performance.

yes, modelling the global effect as well as branch interactions will be useful thing to do. Note that newly introduced branches can change BPU behavior thus lead to different branch misprediction distribution too.

>> David
>>
>>>> The default misprediction rate used by the compiler (currently 25%) is expected to be less than the threshold that motivates a conversion to a cmov based on mispredict data. So, for example if a branch mispredicts 50% of the time, we could convert that to a cmov. Then the cmov will get compared with a branch that mispredicts 25% of the time, making the branch perhaps more desirable than it would have been if we had mispredict data. It is not necessary that the rest of the heuristics will allow a conversion back to a branch, but the cmov decision will be for sure revertible.
>>>
>>> nit: saying misprediction rate here and in the RFC is a bit confusing because today we don't have that data in profile. that threshold is how biased a branch is, which is a proxy for branch miss. But branch predictor could still do well (low branch miss) for unbiased branches.
>>>
>>>> In terms of making this decision at the BOLT level, it might have more limited applicability compared to a LLVM IR pass since it is a bit harder to find which branches are eligible to be converted to cmovs and employing dataflow-based heuristics as the ones possible in LLVM IR seem quite tricky.
>>>
>>> Yes, that is a different challenge.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120230/new/

https://reviews.llvm.org/D120230