<div dir="ltr">Hi Andrea,<div><br></div><div>Thanks for running this test, and the explanation. Can you run the tests on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do this in subtarget way, it would be good to know the performance there.</div><div><br></div><div>Regards,</div><div><br></div><div>-Rong<br><br><div class="gmail_quote"><div dir="ltr">On Wed, Sep 26, 2018 at 6:54 AM Andrea Di Biagio via Phabricator <<a href="mailto:reviews@reviews.llvm.org">reviews@reviews.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">andreadb added a comment.<br>

<br>

Hi Rong,<br>

<br>

On Jaguar, this pass may increase branch density to the point where it hurts the performance of the branch predictor.<br>

<br>

Branch prediction in Jaguar is affected by branch density.<br>

To give you a bit of background: Jaguar's BTB is logically partitioned in two levels. A first level, which is specialized in sparse branches; a second level which is specialized in dense branches, and it is dynamically allocated (when there are more than 2 branches per cache line).<br>

L2 is a bit slower (dynamically allocated), and tends to have a lower throughput thant the <a href="https://reviews.llvm.org/L1" rel="noreferrer" target="_blank">https://reviews.llvm.org/L1</a>. So, ideally, <a href="https://reviews.llvm.org/L1" rel="noreferrer" target="_blank">https://reviews.llvm.org/L1</a> should be used as much as possible.<br>

<br>

This patch increases branch density to the point where the L2 BTB usage increases, and the efficiency of the branch predictor decreases.<br>

<br>

Bench: 4evencases.cc<br>

--------------------<br>

<br>

Without your patch (10 runs):<br>

<br>

  Each iteration uses 902058 nano seconds<br>

  Case counts: 0 261000000 250000000 246000000 243000000<br>

  Each iteration uses 887837 nano seconds<br>

  Case counts: 0 281000000 253000000 227000000 239000000<br>

  Each iteration uses 887856 nano seconds<br>

  Case counts: 0 256000000 254000000 236000000 254000000<br>

  Each iteration uses 880632 nano seconds<br>

  Case counts: 0 279000000 236000000 244000000 241000000<br>

  Each iteration uses 1.03057e+06 nano seconds<br>

  Case counts: 0 258000000 257000000 243000000 242000000<br>

  Each iteration uses 883759 nano seconds<br>

  Case counts: 0 248000000 262000000 278000000 212000000<br>

  Each iteration uses 910438 nano seconds<br>

  Case counts: 0 248000000 254000000 243000000 255000000<br>

  Each iteration uses 885671 nano seconds<br>

  Case counts: 0 258000000 266000000 231000000 245000000<br>

  Each iteration uses 912325 nano seconds<br>

  Case counts: 0 225000000 264000000 270000000 241000000<br>

  Each iteration uses 904952 nano seconds<br>

  Case counts: 0 261000000 240000000 241000000 258000000<br>

<br>

With your patch (10 runs):<br>

<br>

  Each iteration uses 916110 nano seconds<br>

  Case counts: 0 223000000 266000000 263000000 248000000<br>

  Each iteration uses 918773 nano seconds<br>

  Case counts: 0 266000000 230000000 236000000 268000000<br>

  Each iteration uses 903100 nano seconds<br>

  Case counts: 0 250000000 249000000 231000000 270000000<br>

  Each iteration uses 923196 nano seconds<br>

  Case counts: 0 241000000 243000000 276000000 240000000<br>

  Each iteration uses 911282 nano seconds<br>

  Case counts: 0 241000000 239000000 266000000 254000000<br>

  Each iteration uses 910201 nano seconds<br>

  Case counts: 0 210000000 263000000 260000000 267000000<br>

  Each iteration uses 925672 nano seconds<br>

  Case counts: 0 245000000 265000000 236000000 254000000<br>

  Each iteration uses 932643 nano seconds<br>

  Case counts: 0 235000000 259000000 256000000 250000000<br>

  Each iteration uses 937735 nano seconds<br>

  Case counts: 0 261000000 242000000 259000000 238000000<br>

  Each iteration uses 954895 nano seconds<br>

  Case counts: 0 254000000 239000000 271000000 236000000<br>

<br>

Overall, 4evencases.cc is ~2% slower with this patch.<br>

<br>

Bench: 15evencases.cc<br>

---------------------<br>

<br>

Without your patch (10 runs):<br>

<br>

  Each iteration uses 1.10148e+06 nano seconds<br>

  Case counts: 0 56000000 60000000 68000000 61000000 69000000 64000000 80000000 64000000 68000000 66000000 83000000 74000000 50000000 73000000 64000000<br>

  Each iteration uses 1.0648e+06 nano seconds<br>

  Case counts: 0 71000000 59000000 55000000 64000000 73000000 57000000 55000000 74000000 76000000 67000000 77000000 57000000 82000000 54000000 79000000<br>

  Each iteration uses 1.06872e+06 nano seconds<br>

  Case counts: 0 55000000 80000000 59000000 45000000 70000000 61000000 68000000 72000000 77000000 67000000 88000000 63000000 61000000 77000000 57000000<br>

  Each iteration uses 1.04146e+06 nano seconds<br>

  Case counts: 0 68000000 61000000 67000000 50000000 70000000 68000000 73000000 69000000 61000000 78000000 69000000 64000000 67000000 75000000 60000000<br>

  Each iteration uses 1.0549e+06 nano seconds<br>

  Case counts: 0 66000000 75000000 64000000 64000000 74000000 78000000 63000000 64000000 67000000 57000000 65000000 63000000 74000000 66000000 60000000<br>

  Each iteration uses 1.04246e+06 nano seconds<br>

  Case counts: 0 66000000 69000000 63000000 76000000 66000000 78000000 44000000 66000000 61000000 75000000 66000000 70000000 67000000 64000000 69000000<br>

  Each iteration uses 1.07907e+06 nano seconds<br>

  Case counts: 0 63000000 66000000 81000000 68000000 56000000 71000000 71000000 68000000 58000000 65000000 64000000 75000000 63000000 71000000 60000000<br>

  Each iteration uses 1.05432e+06 nano seconds<br>

  Case counts: 0 66000000 67000000 70000000 65000000 57000000 53000000 62000000 62000000 63000000 74000000 68000000 81000000 70000000 77000000 65000000<br>

  Each iteration uses 1.04041e+06 nano seconds<br>

  Case counts: 0 71000000 71000000 65000000 69000000 77000000 67000000 52000000 60000000 73000000 80000000 76000000 66000000 55000000 49000000 69000000<br>

  Each iteration uses 1.07782e+06 nano seconds<br>

  Case counts: 0 68000000 76000000 63000000 79000000 76000000 71000000 65000000 61000000 63000000 63000000 61000000 56000000 67000000 61000000 70000000<br>

<br>

With your patch (10 runs):<br>

<br>

  Each iteration uses 1.11151e+06 nano seconds<br>

  Case counts: 0 64000000 64000000 73000000 72000000 69000000 75000000 66000000 70000000 77000000 59000000 50000000 74000000 68000000 58000000 61000000<br>

  Each iteration uses 1.28406e+06 nano seconds<br>

  Case counts: 0 68000000 63000000 66000000 69000000 68000000 58000000 71000000 60000000 80000000 66000000 80000000 69000000 57000000 62000000 63000000<br>

  Each iteration uses 1.18149e+06 nano seconds<br>

  Case counts: 0 67000000 68000000 66000000 69000000 71000000 67000000 64000000 69000000 72000000 61000000 73000000 60000000 66000000 71000000 56000000<br>

  Each iteration uses 1.30169e+06 nano seconds<br>

  Case counts: 0 74000000 66000000 69000000 64000000 70000000 64000000 59000000 61000000 53000000 75000000 74000000 58000000 72000000 68000000 73000000<br>

  Each iteration uses 1.15588e+06 nano seconds<br>

  Case counts: 0 62000000 66000000 67000000 62000000 79000000 65000000 59000000 54000000 65000000 61000000 62000000 82000000 74000000 68000000 74000000<br>

  Each iteration uses 1.16992e+06 nano seconds<br>

  Case counts: 0 69000000 64000000 71000000 60000000 60000000 70000000 64000000 77000000 65000000 75000000 61000000 70000000 61000000 77000000 56000000<br>

  Each iteration uses 1.2683e+06 nano seconds<br>

  Case counts: 0 66000000 69000000 73000000 76000000 72000000 59000000 64000000 61000000 53000000 78000000 66000000 63000000 66000000 57000000 77000000<br>

  Each iteration uses 1.17196e+06 nano seconds<br>

  Case counts: 0 67000000 69000000 84000000 52000000 56000000 70000000 58000000 64000000 71000000 72000000 67000000 68000000 68000000 73000000 61000000<br>

  Each iteration uses 1.28627e+06 nano seconds<br>

  Case counts: 0 70000000 70000000 70000000 57000000 73000000 71000000 70000000 57000000 57000000 67000000 69000000 61000000 60000000 76000000 72000000<br>

  Each iteration uses 1.28318e+06 nano seconds<br>

  Case counts: 0 61000000 72000000 70000000 80000000 68000000 59000000 59000000 65000000 49000000 78000000 65000000 64000000 64000000 77000000 69000000<br>

<br>

Here the performance varies a lot depending on whether we are in the dense branch portion, or not. Note also that prediction through the L2 BTB has a lower throughput (as in branches per cycle).<br>

<br>

Excluding outliers, the average performance degradation is ~8-10%.<br>

<br>

While this analysis has been only conducted on Jaguar, I suspect that similar problems would affect AMD Bobcat too, since branch prediction for that core is similar to the one in Jaguar.<br>

<br>

I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.<br>

<br>

However, at least for now, I suggest to make this pass optional (i.e. make this pass opt-in for subtargets).<br>

Definitely, it should be disabled for Jaguar (BtVer2) and Bobcat.<br>

<br>

-Andrea<br>

<br>

<br>

<a href="https://reviews.llvm.org/D46662" rel="noreferrer" target="_blank">https://reviews.llvm.org/D46662</a><br>

<br>

<br>

<br>

</blockquote></div></div></div>