[PATCH] D67281: [AArch64][SimplifyCFG] Add additional cost for instructions in mergeConditionalStoreToAddress

Mon Nov 25 01:00:34 PST 2019

kpdev42 added a comment.

In D67281#1756463 <https://reviews.llvm.org/D67281#1756463>, @lebedev.ri wrote:

> On X86 branch misprediction is ~10..~20 cycles,

Very interesting, I didn't measure cycles of misprediction. Only time of execution.

> Is it dirt cheap on AArch64? (any number?)

Excuse me, I didn't get the question :) You asked about number of  branch misprediction cycles?

> Perhaps we need to redefine the threshold in terms of branch misprediction cost?

It is actually sounds very promising. But at the moment I do not know where to get this cost :) Is it supposed to be in processor description (e.g. for Cortex-75 -https://developer.arm.com/docs/100403/0301 ) ?

And yet another thought: maybe we will just compare execution latency / throughput for merged and non-merged variants and will choose the variant with the smallest total value?

For example  (all data about latency/throughput is taken from https://developer.arm.com/docs/101398/0200/arm-cortex-a75-software-optimization-guide-v20  / code taken from https://bugs.llvm.org/show_bug.cgi?id=43205 ):

  Non-merged variant
                              | Execution Latency | Execution Throughput
  ---------------------------------------------------------------------
  tst     x18, x1             | 1                 | 2
  b.eq    .LBB0_10            | 1                 | 1
  .LBB0_9:                    |                   | 
  orr     x16, x16, x18       | 1                 | 2
  add     w0, w0, #1          | 1                 | 2
  str     xzr, [x13, #56]     | 1                 | 1
  .LBB0_10:                   |                   | 
  cbz     x11, .LBB0_7        | 1                 | 1
  str	xzr, [x13,#56]          | 1                 | 1
  ---------------------------------------------------------------------
  Total:                      | 7                 | 10

  Merged variant
                              | Execution Latency | Execution Throughput
  ---------------------------------------------------------------------
  and	x3, x2, x1              | 1                 | 2
  tst	x2, x1                  | 1                 | 2
  orr	x5, x11, x3             | 1                 | 2
  cinc	w0, w0, ne          | 1                 | 2
  csel	x3, xzr, x2, eq     | 1                 | 2
  cbz	x5, .LBB0_7             | 1                 | 1
  str	xzr, [x13,#56]          | 1                 | 1
  orr	x16, x16, x3            | 1                 | 2
  ---------------------------------------------------------------------
  Total:                      | 8                 | 14

In case above non-merged variant is better.

Is it valid approach?

================
Comment at: llvm/lib/Transforms/Utils/SimplifyCFG.cpp:2420-2423
+  // We need to be sure, that DomBlock has
+  // enough room for new instructions
+  // First add cost of Select instruction, that will be added to this block
+  // (this cost is equal to number of phi nodes in BB)
----------------
lebedev.ri wrote:
> So in other words we'd only perform the fold only if the preceding block is not larger than
> what we'd add via folding. In other words if we are okay flatteing 4-instruction 2-entry PHI,
> the dominating BB must contain less than 4 instructions.
> That seems awfully hand-wavy to me, i'm afraid :(
> It will make the fold not happen in all the cases i'm aware of. 
Yes, agree, using this comparison is a hand-wave as it is :)
```
 if (Cost > PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic) {
``` 
So we need a separate threshold here. I will change it today

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D67281/new/

https://reviews.llvm.org/D67281