<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/56383>56383</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Redundant jmp instructions make N-body C++ benchmark ~30% slower than on GCC (Benchmarks Game)
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:X86,
llvm:codegen,
performance
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
yurai007
</td>
</tr>
</table>
<pre>
**Problem description**
Consider following C++ snippet: https://godbolt.org/z/oaY61Yjhz. The most relevant part - std::sqrt computation coming from advance function looks like:
...
double dSquared = dx * dx + dy * dy + dz * dz;
double mag = dt / (dSquared * std::sqrt(dSquared));
...
In unrolled and vectorized output produced by GCC before every sqrt computation we perform checking dSquared against 0 and conditional jump in case of slow path:
vxorpd xmm8, xmm8, xmm8
...
vucomisd xmm8, xmm4 ; Compare dSquared with 0
ja .L42 ; Jump if dSquared < 0
vsqrtsd xmm0, xmm4, xmm4 ; Compute square root
That's OK case scenario. In NOK case scenario - code generated by Clang - we perform one extra redundant unconditional jmp instruction after dSquared check, at the end of every block containing vsqrtsd:
vxorpd xmm5, xmm5, xmm5
vucomisd xmm1, xmm5
...
jb .LBB0_2
vsqrtsd xmm0, xmm1, xmm1
jmp .LBB0_3 ; Redundant unconditional jmp
**Impact on Benchmarks Game N-body benchmark**
Presented advance function is core computation part in one of C++ Benchmarks Game programs - N-body benchmark: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gpp-9.html. Relevant ja+vsqrtsd (GCC) and jb+vsqrtsd+jmp (Clang) sequences can be seen in benchmark assembly output: https://godbolt.org/z/zvE1W3jvK Since hot path containing those instructions is executed tens of millions of times, in case of Clang that one extra jmp cause significant ~30% slowdown in comparison to binary produced by GCC.
The problem with redundant branch for Clang binary can be seen in perf output as well. If you isolate only advance loop from N-body benchmark (which anyway dominates whole execution time), build the code (using exact command from: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gpp-9.html for both Clang and GCC) and run it, you should get similar perf output:
[yurai@archlinux release]$ perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses,L1-icache-misses ./nbody-cc-clang 50000000
time (ms): 4061
Performance counter stats for './nbody-cc-clang 50000000':
4062.76 msec task-clock:u # 1.000 CPUs utilized
81398 cache-references:u # 20.035 K/sec (71.35%)
45289 cache-misses:u # 55.639 % of all cache refs (57.09%)
12402630163 cycles:u # 3.053 GHz (57.17%)
21957136382 instructions:u # 1.77 insn per cycle (71.49%)
1150438778 branches:u # 283.167 M/sec (71.50%)
6943 branch-misses:u # 0.00% of all branches (71.48%)
50259 L1-icache-misses:u (71.41%)
4.063508370 seconds time elapsed
[yurai@archlinux release]$ perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses,L1-icache-misses ./nbody-cc-gcc 50000000
time (ms): 3277
Performance counter stats for './nbody-cc-gcc 50000000':
3278.81 msec task-clock:u # 1.000 CPUs utilized
87807 cache-references:u # 26.780 K/sec (71.43%)
49205 cache-misses:u # 56.038 % of all cache refs (57.18%)
10020295771 cycles:u # 3.056 GHz (57.18%)
19647384440 instructions:u # 1.96 insn per cycle (71.45%)
650524259 branches:u # 198.402 M/sec (71.45%)
371 branch-misses:u # 0.00% of all branches (71.38%)
29921 L1-icache-misses:u (71.36%)
3.279453499 seconds time elapsed
Clang binary (nbody-cc-clang) execution time is nearly ~30% higher with nearly 2x branches executed in comparison to GCC binary run (nbody-cc-gcc). Interestingly, branch misses and L1-icache-misses rate is quite low for both runs.
In case of Clang there is more instructions cache misses which can be explained by looking again at advance assembly: https://godbolt.org/z/oaY61Yjhz where cold blocks are interleaved with hot ones.
Even if it could be improved I believe it's minor CodeGen issue orthogonal to one reported, after all what really matters here is number of branches.
**Workarounds**
Finally there are at least 2 workarounds. In first after passing -fno-math-errno, Clang emits branchless advance function assembly and run time is as fast as for GCC. Another way to improve Clang output is replacing "constexpr" specifier with "static" one for advance. After doing that advance function body is inlined and whole loop is vectorized by poking LV which somehow mitigates redundant jmp instructions (still seen in assembly) penalty. In consequence execution time matches GCC.
Although problem is quite easy to workaround in case of N-body benchmark I still believe it would be nice to have some general improvement in LLVM CodeGen.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzVWFtzozgW_jXOi8oUFwPmIQ9J-rK93TPbNTM7l6cpAbIhAURLwonz6_c7Amx8a2drZ7dq3U4bBNK5fOd80jmpzLe3M_8O369KppWoWS50psrWlLLpH8zcdzN3-P9BNrrMhWIrWVXyuWzW7GHm3-PLdFO2rTCz4I4VxrQaFzP_A75rmaeyMo5Ua9y94k_yPyLvj8fi1WG_FILVUhumRCU2vDGs5cqwOdMmpxWCO_0N95ms285wUoquSe5KyZrxHHMywVZdk9mHlZRPmlXlk6DZE83Z8HEc53Aglx3sZvnP3zquRM5mwTuWvzAY3v_cs3zb3237u9f-7nUW3J9dqebrfhGDFz_gb7lfGxMPDJs8nPkJfY8XtfomdDW15lPDukYBAyzKm5xtRGakKl9xKzsDV7FWybzLcJ9u2ceHB5aKlVSCiY1QW3bi02fBWqHwRs2yQmRP5OCd1nzNywYQuVZUJpu8pEm8Yo9d3bISkHAtmFwxjZgAgKa44PzNi1RtzthLXS9n_sPh7_dh2nSEuz6cvIBr4DH2AFOg6l7l59IUzD1c4ZEPS39Z-OzoQ4v83RqzmkbCw_EaG_IclIBsd6fDqS6dEfAxrcKUlOYAul8KDthjzf7xufebzkTDVSkdBlR_PB5FKmQyF2wtGqG46QF9qDgAmk9hkw3AfTGKI5PyrskplZAUU7AsVtqork8VvjJI5J21FneyhRtmkJUCWAPSPmDSSmZPBL1BKFBwDI64jnM4uGf_exlWb_9OciUcHtMRzPt790__Kkre7vdwGfhkv0xwEhM_Xfbl1O7hf0uXnxCKmQEe7F40WVFzBUL6yGvBfpynoFtk4jB8Sq9fldCiIZBPiK3UcD_iaZq1liqRfoQ9oBqZ-FguqGCteK0RMCcqnLD17pFeY-7cCF47LV8L7eQiLXnjNOD449cwMAjBVUMi5uu2nSdOYerKgRcHbn_k0G9EB9wHYgLnWVp5TPePcEXA4AUb6PSKFt86iBTwAm-gPwZEQ7bvFGFca1Gn1XZgwLdsRK-b995vwePmM_u5JG8X0lj6mka6KSTScZI4mrAQLyLrCCgjMADn12VV2Ye4NmUtNAXchBr7lDVI_kmqkpUZ7yjdy3VTrsqMvDSL3weI29DSaS6frZ2ZpbhSA3YjWVqCHLbHJO8c8oxF3u7plg73vJAqxFaBPVwNag3LHbmWqGXcTrgG2VTA8tOKbWUHF8gKbARb4PAxWLH5tv2-fBxnhOVzUUIob7bPfIu9Eps4FsCyhazE4E8KanKe3QuxZ3VllVsyshSINTpNkIgXyjB4pKbIIYH_izi2_kolPNk7jWRPAlh1cJohtck_upAddF8LA2gRG1xN3XmGNmfh_bZTvJwtXK6yoiqb7sUeihA_s_DdzF_0C2jkPpsLZrh-mmdEyxCZcbD3XImVUDZJdkOgVt3fbrPKXkwDGbd9KIj95X7KF29eTldhzs4pWQbR5IPQ7T97Owg-QqrW9jRzxxZu5B0b-7XftGzQZLJraCMiw7T1MXbH74ny4wu7Dj4Q5jtxxGotsqmLgrvuaMP3Ldl7DlZkD1__qRmir7IHqP07x6sztvSCZNlfnvh8IqNf3XcdNwjZZ9hC6pz5wE-x5wTYG0Py1qm8Regvk6m8AZ0je3p5YehEQcKIOUA4vKr6OYiild7JC2MHm-upPM9fuH4UuF407IJDxJx4buK9wHHDgH382-s520ZpXnxGmu8lYewFUbAcjmIHYXnWOmAVx_ZFy0y9fie-XJyzjXle6C6CZRwP2O2i_ox1Vpq_DBwvitkP17AL3YvYsShZBFN558EbrEOsuBPkRg1P7FtelBe6fjjEynHqnkfx8meQ5U1lHUhcOG4UhO4yiF1sF3Qy0n3mi4q3GsXM_zm7rbPsLdwW-HH8n3HbgaDvMRtELZ2l999jtnjpxv3lG5gtcvD6dWZbBJeZLfHdcCrvu8wWgUmX15nNO5cdnuv6rg--ib1B3tuYLbrKbGelJdEiDpaLxcLtB97EbEl0ndnO7xJR6Ib-Ypf5V5nNS5YOmP46s12Q13-C0Zd_HbMFl5nNTxJ_kPfXMFsQXWS2wPHjZBEGiyS5ymxDT2x6hMb6h-cWOh4eHm6pfGgEVzg47076RbkugLw9pw_P_Je9q3bVxkkhYFs7vWg6gE7Fg1ognLoKoCChDU7O1dYeq_vj_8B7dHY9IUPqNJCe37rS0Ln-eX_4hRjtsF0b6ri-gSiaWFOtelA09Rk7rN_XAkO9IV7aCsVWX8ZQA4_O-LbpRL2IsbgYy7t_q8EIQcIWzTiL2x4G7LWKwSXYfDZjq4gKP5Rleqig3m-oAlrhPE8UTnMxp0ZpQBM-4a4qxUbY434MW8uGKimUKB9pmtYdPKJQN65trwAoUcWnRItB6vM9DK0XyoZnKgiVwOWW1dxgWLPRh01Xp3gNzh0D4aDA69sHv0n1xBX2mVyf9hM-IDJo5R4WshzSaNM1zGfP-5m29bQqFcZ71Vp4m1CYrxo5h17FXCjVSNK9B1rUJXa0Xi-QqT5tWOzK8bE4GmMfpeSKNOD9jkh1K7trpLEJgNIQ_hp8PcgaSlBMhQ8rnpFiM99HbmqD2FG4ZroVGernMYcwRHtumdEz8j5JGlSEtL7zJcuxIj9R3lavEFg2lQ1MsqEvVG2NiyeTjiuitu2D9suvQ2RrWYsCWQMvlWtb5u6r76M-nKakRXYiGMbCexfq4I5WAEKztQiRxUMf5JhTgJFlin0X4K5CBHbrYtcE2KUz8Lde3gfAtFFxUrp_Yr12-6jHzCEpmhKqYKkCqWSNHvqU1QhhLRrbovry5dcfxhRxbvLbIE-ChN-Y0lTi9qfLvqn5065xNva3Jl2FaacEkAJM6oZZVoRXjxphcOdNp6rbI_ZAvHSpA1rFTVVtxp859H8EyLi1KY30-hCiYAluittVmPPU5Xkep8tliB0_it0ojjnPQx__kpuKw1v6FsdeBGDKsyfRUK_09yU2Hp-yCD9WVnBHrQ04bT_e7g-QNBi-uylvcYDx3dgNPdRoIc6Bq0QIvkijJFrlvh_hWC1qXlYOrUk8eKNurQ1pt9Z4WJXa6P1DSu51I4TVD-vzDqGibu0J3XXjG2vvrTX2X-XL1ZE">