[PATCH] D22696: [CodeGenPrep] Skip merging empty case blocks

Wed Oct 5 14:03:05 PDT 2016

On Wed, Oct 5, 2016 at 12:12 PM, Jun Bum Lim <junbuml at codeaurora.org> wrote:
> junbuml added a comment.
>
>> Does it also enable more machine code sinking? If not, perhaps the heuristics to count copy insertion phis can be improved?
>
> The initial purpose of this patch was to handle the cases which cannot be handled in MachineSink by letting ISel insert COPY in better place.  By avoiding creating critical edge which is non-splittable in MachineSink, I think it could also indirectly increase the change of machine sinking. Since more number of copy insertion PHIs means more chances to sink in later pass, integrating the number of PHIs in the heuristics seems to be reasonable to me.

This part is probably fine. My question was (by looking at the actual
case in povray): do all PHIs end up creating move/copy instructions in
PredBB (if block merging is done) or not. If there are cases where
they don't actual create such overhead, then empty block merging
should be done regardless of the frequency.   In other words, my
suggestion is to study the cases where the block merging is falsely
rejected/skipped and find if there are more signals to look at -- and
hopefully come up with a heuristic to minimize false positives.

 However,  I think the minimum frequency check should come first
because we don't want to leave extra branch in a likely executed block
regardless of the number of PHIs.  To integrate  NumCopyInsertionPHIs
and minimum frequency requirement, I thought about the heuristics
something like :
>
> CurFreqRatio = Freq(PredBB) / Freq(BB);
> if (CurFreqRatio  > MinThresholdToSkipMerge && (NumCopyInsertionPHIs * CurFreqRatio) > MinNumPhiInDestToSkipMerge * MinThresholdToSkipMerge) {
>
>   // skip merging current empty case
>
> }
>

See below.

> Please let me know your thought.
>
>> How do you come up with 1000? It seems like specially tuned for that benchmark where the switch top is in a loop while the switch body is not -- the patch basically will never kick in for any switch in normal form without profile.
>
> In case of the benchmark I was targeting, it has a huge switch in a loop and the switch have many empty cases only used as incoming blocks in PHIs.

That is what I said -- with 1000, it needs a huge switch (with >1000
cases) for this to kick in. This means the cost model is very
imprecise.

> So, merging those empty cases cause many COPYs in the header of switch and MachineSink cannot sink them because it cannot split the critical edges across the jump table. Since the switch is huge, the ratio of FreqOfSwitchHeader to FreqOfEmptyCase was pretty large, more than 3000. In this patch I wanted to hit only such clear cases where the frequency of case is certainly lower than the switch header because the cost of extra branch added by skipping merging empty cases is sometime non-trivial especially when the case is likely to be taken as we observed in the povray case.
>

If there are multiple empty case blocks for the switch, they need to
be considered together. They are either all merged or all be skipped.
If we skip one empty block where merge another, machine sinking pass
won't be able t to sink the copy instruction to the block that is not
merged, and skipping merging will only create jump overhead with zero
benefit.

The cost modeling will also need to consider all empty blocks
together. Assume there are 3 empty blocks,  the frequency of PredBB is
FreqP, the frequencies of the three empty blocks are F1, F2, and F3.
Assume copy instruction cost is CM, and cost of direct jump is CJ.
The cost saving of machine sinking by not merging the three blocks is

          (FreqP - (F1+F2+F3)) * CM, and the savings of eliminating
direct jumps by merging is:
          (F1+F2+F3) *CJ

In order to save overall cost by enabling machine sinking, we need
     (FreqP - (F1+F2+F3))* CM > (F1+F2+F3)*CJ

which becomes
        FreqP/(F1+F2+F3) >  (CJ+CM)/CM

Assuming CJ==CM, we have
     FreqP/(F1+F2+F3) > 2

basically the ratio of PredBB frequency and the sum of all empty block
frequency need to larger than 2.   The more the number of empty
blocks, the harder it is to skip merging.

See reply at the beginning -- this cost model still assumes that the
copy inserting PHI will result in copy instruction that can be sinked
by MachineSinking pass (and result in savings) -- but this may not be
true thus may need tuning.

David

>
> https://reviews.llvm.org/D22696
>
>
>