[PATCH] D67281: [AArch64][SimplifyCFG] Add additional cost for instructions in mergeConditionalStoreToAddress
Pavel Kosov via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Nov 25 01:00:34 PST 2019
kpdev42 added a comment.
In D67281#1756463 <https://reviews.llvm.org/D67281#1756463>, @lebedev.ri wrote:
> On X86 branch misprediction is ~10..~20 cycles,
Very interesting, I didn't measure cycles of misprediction. Only time of execution.
> Is it dirt cheap on AArch64? (any number?)
Excuse me, I didn't get the question :) You asked about number of branch misprediction cycles?
> Perhaps we need to redefine the threshold in terms of branch misprediction cost?
It is actually sounds very promising. But at the moment I do not know where to get this cost :) Is it supposed to be in processor description (e.g. for Cortex-75 -https://developer.arm.com/docs/100403/0301 ) ?
And yet another thought: maybe we will just compare execution latency / throughput for merged and non-merged variants and will choose the variant with the smallest total value?
For example (all data about latency/throughput is taken from https://developer.arm.com/docs/101398/0200/arm-cortex-a75-software-optimization-guide-v20 / code taken from https://bugs.llvm.org/show_bug.cgi?id=43205 ):
Non-merged variant
| Execution Latency | Execution Throughput
---------------------------------------------------------------------
tst x18, x1 | 1 | 2
b.eq .LBB0_10 | 1 | 1
.LBB0_9: | |
orr x16, x16, x18 | 1 | 2
add w0, w0, #1 | 1 | 2
str xzr, [x13, #56] | 1 | 1
.LBB0_10: | |
cbz x11, .LBB0_7 | 1 | 1
str xzr, [x13,#56] | 1 | 1
---------------------------------------------------------------------
Total: | 7 | 10
Merged variant
| Execution Latency | Execution Throughput
---------------------------------------------------------------------
and x3, x2, x1 | 1 | 2
tst x2, x1 | 1 | 2
orr x5, x11, x3 | 1 | 2
cinc w0, w0, ne | 1 | 2
csel x3, xzr, x2, eq | 1 | 2
cbz x5, .LBB0_7 | 1 | 1
str xzr, [x13,#56] | 1 | 1
orr x16, x16, x3 | 1 | 2
---------------------------------------------------------------------
Total: | 8 | 14
In case above non-merged variant is better.
Is it valid approach?
================
Comment at: llvm/lib/Transforms/Utils/SimplifyCFG.cpp:2420-2423
+ // We need to be sure, that DomBlock has
+ // enough room for new instructions
+ // First add cost of Select instruction, that will be added to this block
+ // (this cost is equal to number of phi nodes in BB)
----------------
lebedev.ri wrote:
> So in other words we'd only perform the fold only if the preceding block is not larger than
> what we'd add via folding. In other words if we are okay flatteing 4-instruction 2-entry PHI,
> the dominating BB must contain less than 4 instructions.
> That seems awfully hand-wavy to me, i'm afraid :(
> It will make the fold not happen in all the cases i'm aware of.
Yes, agree, using this comparison is a hand-wave as it is :)
```
if (Cost > PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic) {
```
So we need a separate threshold here. I will change it today
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D67281/new/
https://reviews.llvm.org/D67281
More information about the llvm-commits
mailing list