[PATCH] D46662: [X86] condition branches folding for three-way conditional codes

Wed Sep 26 09:39:03 PDT 2018

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

Regards,

-Rong

On Wed, Sep 26, 2018 at 6:54 AM Andrea Di Biagio via Phabricator <
reviews at reviews.llvm.org> wrote:

> andreadb added a comment.
>
> Hi Rong,
>
> On Jaguar, this pass may increase branch density to the point where it
> hurts the performance of the branch predictor.
>
> Branch prediction in Jaguar is affected by branch density.
> To give you a bit of background: Jaguar's BTB is logically partitioned in
> two levels. A first level, which is specialized in sparse branches; a
> second level which is specialized in dense branches, and it is dynamically
> allocated (when there are more than 2 branches per cache line).
> L2 is a bit slower (dynamically allocated), and tends to have a lower
> throughput thant the https://reviews.llvm.org/L1. So, ideally,
> https://reviews.llvm.org/L1 should be used as much as possible.
>
> This patch increases branch density to the point where the L2 BTB usage
> increases, and the efficiency of the branch predictor decreases.
>
> Bench: 4evencases.cc
> --------------------
>
> Without your patch (10 runs):
>
>   Each iteration uses 902058 nano seconds
>   Case counts: 0 261000000 250000000 246000000 243000000
>   Each iteration uses 887837 nano seconds
>   Case counts: 0 281000000 253000000 227000000 239000000
>   Each iteration uses 887856 nano seconds
>   Case counts: 0 256000000 254000000 236000000 254000000
>   Each iteration uses 880632 nano seconds
>   Case counts: 0 279000000 236000000 244000000 241000000
>   Each iteration uses 1.03057e+06 nano seconds
>   Case counts: 0 258000000 257000000 243000000 242000000
>   Each iteration uses 883759 nano seconds
>   Case counts: 0 248000000 262000000 278000000 212000000
>   Each iteration uses 910438 nano seconds
>   Case counts: 0 248000000 254000000 243000000 255000000
>   Each iteration uses 885671 nano seconds
>   Case counts: 0 258000000 266000000 231000000 245000000
>   Each iteration uses 912325 nano seconds
>   Case counts: 0 225000000 264000000 270000000 241000000
>   Each iteration uses 904952 nano seconds
>   Case counts: 0 261000000 240000000 241000000 258000000
>
> With your patch (10 runs):
>
>   Each iteration uses 916110 nano seconds
>   Case counts: 0 223000000 266000000 263000000 248000000
>   Each iteration uses 918773 nano seconds
>   Case counts: 0 266000000 230000000 236000000 268000000
>   Each iteration uses 903100 nano seconds
>   Case counts: 0 250000000 249000000 231000000 270000000
>   Each iteration uses 923196 nano seconds
>   Case counts: 0 241000000 243000000 276000000 240000000
>   Each iteration uses 911282 nano seconds
>   Case counts: 0 241000000 239000000 266000000 254000000
>   Each iteration uses 910201 nano seconds
>   Case counts: 0 210000000 263000000 260000000 267000000
>   Each iteration uses 925672 nano seconds
>   Case counts: 0 245000000 265000000 236000000 254000000
>   Each iteration uses 932643 nano seconds
>   Case counts: 0 235000000 259000000 256000000 250000000
>   Each iteration uses 937735 nano seconds
>   Case counts: 0 261000000 242000000 259000000 238000000
>   Each iteration uses 954895 nano seconds
>   Case counts: 0 254000000 239000000 271000000 236000000
>
> Overall, 4evencases.cc is ~2% slower with this patch.
>
> Bench: 15evencases.cc
> ---------------------
>
> Without your patch (10 runs):
>
>   Each iteration uses 1.10148e+06 nano seconds
>   Case counts: 0 56000000 60000000 68000000 61000000 69000000 64000000
> 80000000 64000000 68000000 66000000 83000000 74000000 50000000 73000000
> 64000000
>   Each iteration uses 1.0648e+06 nano seconds
>   Case counts: 0 71000000 59000000 55000000 64000000 73000000 57000000
> 55000000 74000000 76000000 67000000 77000000 57000000 82000000 54000000
> 79000000
>   Each iteration uses 1.06872e+06 nano seconds
>   Case counts: 0 55000000 80000000 59000000 45000000 70000000 61000000
> 68000000 72000000 77000000 67000000 88000000 63000000 61000000 77000000
> 57000000
>   Each iteration uses 1.04146e+06 nano seconds
>   Case counts: 0 68000000 61000000 67000000 50000000 70000000 68000000
> 73000000 69000000 61000000 78000000 69000000 64000000 67000000 75000000
> 60000000
>   Each iteration uses 1.0549e+06 nano seconds
>   Case counts: 0 66000000 75000000 64000000 64000000 74000000 78000000
> 63000000 64000000 67000000 57000000 65000000 63000000 74000000 66000000
> 60000000
>   Each iteration uses 1.04246e+06 nano seconds
>   Case counts: 0 66000000 69000000 63000000 76000000 66000000 78000000
> 44000000 66000000 61000000 75000000 66000000 70000000 67000000 64000000
> 69000000
>   Each iteration uses 1.07907e+06 nano seconds
>   Case counts: 0 63000000 66000000 81000000 68000000 56000000 71000000
> 71000000 68000000 58000000 65000000 64000000 75000000 63000000 71000000
> 60000000
>   Each iteration uses 1.05432e+06 nano seconds
>   Case counts: 0 66000000 67000000 70000000 65000000 57000000 53000000
> 62000000 62000000 63000000 74000000 68000000 81000000 70000000 77000000
> 65000000
>   Each iteration uses 1.04041e+06 nano seconds
>   Case counts: 0 71000000 71000000 65000000 69000000 77000000 67000000
> 52000000 60000000 73000000 80000000 76000000 66000000 55000000 49000000
> 69000000
>   Each iteration uses 1.07782e+06 nano seconds
>   Case counts: 0 68000000 76000000 63000000 79000000 76000000 71000000
> 65000000 61000000 63000000 63000000 61000000 56000000 67000000 61000000
> 70000000
>
> With your patch (10 runs):
>
>   Each iteration uses 1.11151e+06 nano seconds
>   Case counts: 0 64000000 64000000 73000000 72000000 69000000 75000000
> 66000000 70000000 77000000 59000000 50000000 74000000 68000000 58000000
> 61000000
>   Each iteration uses 1.28406e+06 nano seconds
>   Case counts: 0 68000000 63000000 66000000 69000000 68000000 58000000
> 71000000 60000000 80000000 66000000 80000000 69000000 57000000 62000000
> 63000000
>   Each iteration uses 1.18149e+06 nano seconds
>   Case counts: 0 67000000 68000000 66000000 69000000 71000000 67000000
> 64000000 69000000 72000000 61000000 73000000 60000000 66000000 71000000
> 56000000
>   Each iteration uses 1.30169e+06 nano seconds
>   Case counts: 0 74000000 66000000 69000000 64000000 70000000 64000000
> 59000000 61000000 53000000 75000000 74000000 58000000 72000000 68000000
> 73000000
>   Each iteration uses 1.15588e+06 nano seconds
>   Case counts: 0 62000000 66000000 67000000 62000000 79000000 65000000
> 59000000 54000000 65000000 61000000 62000000 82000000 74000000 68000000
> 74000000
>   Each iteration uses 1.16992e+06 nano seconds
>   Case counts: 0 69000000 64000000 71000000 60000000 60000000 70000000
> 64000000 77000000 65000000 75000000 61000000 70000000 61000000 77000000
> 56000000
>   Each iteration uses 1.2683e+06 nano seconds
>   Case counts: 0 66000000 69000000 73000000 76000000 72000000 59000000
> 64000000 61000000 53000000 78000000 66000000 63000000 66000000 57000000
> 77000000
>   Each iteration uses 1.17196e+06 nano seconds
>   Case counts: 0 67000000 69000000 84000000 52000000 56000000 70000000
> 58000000 64000000 71000000 72000000 67000000 68000000 68000000 73000000
> 61000000
>   Each iteration uses 1.28627e+06 nano seconds
>   Case counts: 0 70000000 70000000 70000000 57000000 73000000 71000000
> 70000000 57000000 57000000 67000000 69000000 61000000 60000000 76000000
> 72000000
>   Each iteration uses 1.28318e+06 nano seconds
>   Case counts: 0 61000000 72000000 70000000 80000000 68000000 59000000
> 59000000 65000000 49000000 78000000 65000000 64000000 64000000 77000000
> 69000000
>
> Here the performance varies a lot depending on whether we are in the dense
> branch portion, or not. Note also that prediction through the L2 BTB has a
> lower throughput (as in branches per cycle).
>
> Excluding outliers, the average performance degradation is ~8-10%.
>
> While this analysis has been only conducted on Jaguar, I suspect that
> similar problems would affect AMD Bobcat too, since branch prediction for
> that core is similar to the one in Jaguar.
>
> I wouldn't be surprised if instead this patch improves the performance of
> code on other big AMD cores like Bulldozer/ryzen.
>
> However, at least for now, I suggest to make this pass optional (i.e. make
> this pass opt-in for subtargets).
> Definitely, it should be disabled for Jaguar (BtVer2) and Bobcat.
>
> -Andrea
>
>
> https://reviews.llvm.org/D46662
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180926/ee4d9de8/attachment.html>