[PATCH] D46662: [X86] condition branches folding for three-way conditional codes

Wed Sep 26 06:54:41 PDT 2018

andreadb added a comment.

Hi Rong,

On Jaguar, this pass may increase branch density to the point where it hurts the performance of the branch predictor.

Branch prediction in Jaguar is affected by branch density.
To give you a bit of background: Jaguar's BTB is logically partitioned in two levels. A first level, which is specialized in sparse branches; a second level which is specialized in dense branches, and it is dynamically allocated (when there are more than 2 branches per cache line).
L2 is a bit slower (dynamically allocated), and tends to have a lower throughput thant the https://reviews.llvm.org/L1. So, ideally, https://reviews.llvm.org/L1 should be used as much as possible.

This patch increases branch density to the point where the L2 BTB usage increases, and the efficiency of the branch predictor decreases.

Bench: 4evencases.cc
--------------------

Without your patch (10 runs):

  Each iteration uses 902058 nano seconds
  Case counts: 0 261000000 250000000 246000000 243000000
  Each iteration uses 887837 nano seconds
  Case counts: 0 281000000 253000000 227000000 239000000
  Each iteration uses 887856 nano seconds
  Case counts: 0 256000000 254000000 236000000 254000000
  Each iteration uses 880632 nano seconds
  Case counts: 0 279000000 236000000 244000000 241000000
  Each iteration uses 1.03057e+06 nano seconds
  Case counts: 0 258000000 257000000 243000000 242000000
  Each iteration uses 883759 nano seconds
  Case counts: 0 248000000 262000000 278000000 212000000
  Each iteration uses 910438 nano seconds
  Case counts: 0 248000000 254000000 243000000 255000000
  Each iteration uses 885671 nano seconds
  Case counts: 0 258000000 266000000 231000000 245000000
  Each iteration uses 912325 nano seconds
  Case counts: 0 225000000 264000000 270000000 241000000
  Each iteration uses 904952 nano seconds
  Case counts: 0 261000000 240000000 241000000 258000000

With your patch (10 runs):

  Each iteration uses 916110 nano seconds
  Case counts: 0 223000000 266000000 263000000 248000000
  Each iteration uses 918773 nano seconds
  Case counts: 0 266000000 230000000 236000000 268000000
  Each iteration uses 903100 nano seconds
  Case counts: 0 250000000 249000000 231000000 270000000
  Each iteration uses 923196 nano seconds
  Case counts: 0 241000000 243000000 276000000 240000000
  Each iteration uses 911282 nano seconds
  Case counts: 0 241000000 239000000 266000000 254000000
  Each iteration uses 910201 nano seconds
  Case counts: 0 210000000 263000000 260000000 267000000
  Each iteration uses 925672 nano seconds
  Case counts: 0 245000000 265000000 236000000 254000000
  Each iteration uses 932643 nano seconds
  Case counts: 0 235000000 259000000 256000000 250000000
  Each iteration uses 937735 nano seconds
  Case counts: 0 261000000 242000000 259000000 238000000
  Each iteration uses 954895 nano seconds
  Case counts: 0 254000000 239000000 271000000 236000000

Overall, 4evencases.cc is ~2% slower with this patch.

Bench: 15evencases.cc
---------------------

Without your patch (10 runs):

  Each iteration uses 1.10148e+06 nano seconds
  Case counts: 0 56000000 60000000 68000000 61000000 69000000 64000000 80000000 64000000 68000000 66000000 83000000 74000000 50000000 73000000 64000000
  Each iteration uses 1.0648e+06 nano seconds
  Case counts: 0 71000000 59000000 55000000 64000000 73000000 57000000 55000000 74000000 76000000 67000000 77000000 57000000 82000000 54000000 79000000
  Each iteration uses 1.06872e+06 nano seconds
  Case counts: 0 55000000 80000000 59000000 45000000 70000000 61000000 68000000 72000000 77000000 67000000 88000000 63000000 61000000 77000000 57000000
  Each iteration uses 1.04146e+06 nano seconds
  Case counts: 0 68000000 61000000 67000000 50000000 70000000 68000000 73000000 69000000 61000000 78000000 69000000 64000000 67000000 75000000 60000000
  Each iteration uses 1.0549e+06 nano seconds
  Case counts: 0 66000000 75000000 64000000 64000000 74000000 78000000 63000000 64000000 67000000 57000000 65000000 63000000 74000000 66000000 60000000
  Each iteration uses 1.04246e+06 nano seconds
  Case counts: 0 66000000 69000000 63000000 76000000 66000000 78000000 44000000 66000000 61000000 75000000 66000000 70000000 67000000 64000000 69000000
  Each iteration uses 1.07907e+06 nano seconds
  Case counts: 0 63000000 66000000 81000000 68000000 56000000 71000000 71000000 68000000 58000000 65000000 64000000 75000000 63000000 71000000 60000000
  Each iteration uses 1.05432e+06 nano seconds
  Case counts: 0 66000000 67000000 70000000 65000000 57000000 53000000 62000000 62000000 63000000 74000000 68000000 81000000 70000000 77000000 65000000
  Each iteration uses 1.04041e+06 nano seconds
  Case counts: 0 71000000 71000000 65000000 69000000 77000000 67000000 52000000 60000000 73000000 80000000 76000000 66000000 55000000 49000000 69000000
  Each iteration uses 1.07782e+06 nano seconds
  Case counts: 0 68000000 76000000 63000000 79000000 76000000 71000000 65000000 61000000 63000000 63000000 61000000 56000000 67000000 61000000 70000000

With your patch (10 runs):

  Each iteration uses 1.11151e+06 nano seconds
  Case counts: 0 64000000 64000000 73000000 72000000 69000000 75000000 66000000 70000000 77000000 59000000 50000000 74000000 68000000 58000000 61000000
  Each iteration uses 1.28406e+06 nano seconds
  Case counts: 0 68000000 63000000 66000000 69000000 68000000 58000000 71000000 60000000 80000000 66000000 80000000 69000000 57000000 62000000 63000000
  Each iteration uses 1.18149e+06 nano seconds
  Case counts: 0 67000000 68000000 66000000 69000000 71000000 67000000 64000000 69000000 72000000 61000000 73000000 60000000 66000000 71000000 56000000
  Each iteration uses 1.30169e+06 nano seconds
  Case counts: 0 74000000 66000000 69000000 64000000 70000000 64000000 59000000 61000000 53000000 75000000 74000000 58000000 72000000 68000000 73000000
  Each iteration uses 1.15588e+06 nano seconds
  Case counts: 0 62000000 66000000 67000000 62000000 79000000 65000000 59000000 54000000 65000000 61000000 62000000 82000000 74000000 68000000 74000000
  Each iteration uses 1.16992e+06 nano seconds
  Case counts: 0 69000000 64000000 71000000 60000000 60000000 70000000 64000000 77000000 65000000 75000000 61000000 70000000 61000000 77000000 56000000
  Each iteration uses 1.2683e+06 nano seconds
  Case counts: 0 66000000 69000000 73000000 76000000 72000000 59000000 64000000 61000000 53000000 78000000 66000000 63000000 66000000 57000000 77000000
  Each iteration uses 1.17196e+06 nano seconds
  Case counts: 0 67000000 69000000 84000000 52000000 56000000 70000000 58000000 64000000 71000000 72000000 67000000 68000000 68000000 73000000 61000000
  Each iteration uses 1.28627e+06 nano seconds
  Case counts: 0 70000000 70000000 70000000 57000000 73000000 71000000 70000000 57000000 57000000 67000000 69000000 61000000 60000000 76000000 72000000
  Each iteration uses 1.28318e+06 nano seconds
  Case counts: 0 61000000 72000000 70000000 80000000 68000000 59000000 59000000 65000000 49000000 78000000 65000000 64000000 64000000 77000000 69000000

Here the performance varies a lot depending on whether we are in the dense branch portion, or not. Note also that prediction through the L2 BTB has a lower throughput (as in branches per cycle).

Excluding outliers, the average performance degradation is ~8-10%.

While this analysis has been only conducted on Jaguar, I suspect that similar problems would affect AMD Bobcat too, since branch prediction for that core is similar to the one in Jaguar.

I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

However, at least for now, I suggest to make this pass optional (i.e. make this pass opt-in for subtargets).
Definitely, it should be disabled for Jaguar (BtVer2) and Bobcat.

-Andrea

https://reviews.llvm.org/D46662