<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/56383>56383</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Redundant jmp instructions make N-body C++ benchmark ~30% slower than on GCC (Benchmarks Game)

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            backend:X86,

            llvm:codegen,

            performance

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          yurai007

      </td>

    </tr>

</table>

<pre>

    **Problem description**

Consider following C++ snippet: https://godbolt.org/z/oaY61Yjhz. The most relevant part - std::sqrt computation coming from advance function looks like:

        ...

        double dSquared = dx * dx + dy * dy + dz * dz;

        double mag = dt / (dSquared * std::sqrt(dSquared));

        ...         

In unrolled and vectorized output produced by GCC before every sqrt computation we perform checking dSquared against 0 and conditional jump in case of slow path:

        vxorpd  xmm8, xmm8, xmm8

        ...

        vucomisd  xmm8, xmm4    ; Compare dSquared with 0

        ja      .L42                ; Jump if dSquared < 0

        vsqrtsd xmm0, xmm4, xmm4        ; Compute square root 

That's OK case scenario. In NOK case scenario - code generated by Clang - we perform one extra redundant unconditional jmp instruction after dSquared check, at the end of every block containing vsqrtsd:

        vxorpd  xmm5, xmm5, xmm5

        vucomisd  xmm1, xmm5    

        ...

        jb      .LBB0_2

        vsqrtsd xmm0, xmm1, xmm1

        jmp     .LBB0_3             ; Redundant unconditional jmp

**Impact on Benchmarks Game N-body benchmark**

Presented advance function is core computation part in one of C++ Benchmarks Game programs - N-body benchmark: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gpp-9.html. Relevant ja+vsqrtsd (GCC) and jb+vsqrtsd+jmp (Clang) sequences can be seen in benchmark assembly output: https://godbolt.org/z/zvE1W3jvK Since hot path containing those instructions is executed tens of millions of times, in case of Clang that one extra jmp cause significant ~30% slowdown in comparison to binary produced by GCC.

The problem with redundant branch for Clang binary can be seen in perf output as well. If you isolate only advance loop from N-body benchmark (which anyway dominates whole execution time), build the code (using exact command from: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gpp-9.html for both Clang and GCC) and run it, you should get similar perf output:

    [yurai@archlinux release]$ perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses,L1-icache-misses ./nbody-cc-clang 50000000

    time (ms): 4061

    Performance counter stats for './nbody-cc-clang 50000000':

           4062.76 msec task-clock:u              #    1.000 CPUs utilized          

             81398      cache-references:u        #   20.035 K/sec                    (71.35%)

             45289      cache-misses:u            #   55.639 % of all cache refs      (57.09%)

       12402630163      cycles:u                  #    3.053 GHz                      (57.17%)

       21957136382      instructions:u            #    1.77  insn per cycle           (71.49%)

        1150438778      branches:u                #  283.167 M/sec                    (71.50%)

              6943      branch-misses:u           #    0.00% of all branches          (71.48%)

             50259      L1-icache-misses:u                                            (71.41%)

       4.063508370 seconds time elapsed

    [yurai@archlinux release]$ perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses,L1-icache-misses ./nbody-cc-gcc 50000000

    time (ms): 3277

    Performance counter stats for './nbody-cc-gcc 50000000':

           3278.81 msec task-clock:u              #    1.000 CPUs utilized          

             87807      cache-references:u        #   26.780 K/sec                    (71.43%)

             49205      cache-misses:u            #   56.038 % of all cache refs      (57.18%)

       10020295771      cycles:u                  #    3.056 GHz                      (57.18%)

       19647384440      instructions:u            #    1.96  insn per cycle           (71.45%)

         650524259      branches:u                #  198.402 M/sec                    (71.45%)

               371      branch-misses:u           #    0.00% of all branches          (71.38%)

             29921      L1-icache-misses:u                                            (71.36%)

       3.279453499 seconds time elapsed

Clang binary (nbody-cc-clang) execution time is nearly ~30% higher with nearly 2x branches executed in comparison to GCC binary run (nbody-cc-gcc). Interestingly, branch misses and L1-icache-misses rate is quite low for both runs. 

In case of Clang there is more instructions cache misses which can be explained by looking again at advance assembly: https://godbolt.org/z/oaY61Yjhz where cold blocks are interleaved with hot ones.

Even if it could be improved I believe it's minor CodeGen issue orthogonal to one reported, after all what really matters here is number of branches.

**Workarounds**

Finally there are at least 2 workarounds. In first after passing -fno-math-errno, Clang emits branchless advance function assembly and run time is as fast as for GCC. Another way to improve Clang output is replacing "constexpr" specifier with "static" one for advance. After doing that advance function body is inlined and whole loop is vectorized by poking LV which somehow mitigates redundant jmp instructions (still seen in assembly) penalty. In consequence execution time matches GCC.

Although problem is quite easy to workaround in case of N-body benchmark I still believe it would be nice to have some general improvement in LLVM CodeGen.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzVWFtzozgW_jXOi8oUFwPmIQ9J-rK93TPbNTM7l6cpAbIhAURLwonz6_c7Amx8a2drZ7dq3U4bBNK5fOd80jmpzLe3M_8O369KppWoWS50psrWlLLpH8zcdzN3-P9BNrrMhWIrWVXyuWzW7GHm3-PLdFO2rTCz4I4VxrQaFzP_A75rmaeyMo5Ua9y94k_yPyLvj8fi1WG_FILVUhumRCU2vDGs5cqwOdMmpxWCO_0N95ms285wUoquSe5KyZrxHHMywVZdk9mHlZRPmlXlk6DZE83Z8HEc53Aglx3sZvnP3zquRM5mwTuWvzAY3v_cs3zb3237u9f-7nUW3J9dqebrfhGDFz_gb7lfGxMPDJs8nPkJfY8XtfomdDW15lPDukYBAyzKm5xtRGakKl9xKzsDV7FWybzLcJ9u2ceHB5aKlVSCiY1QW3bi02fBWqHwRs2yQmRP5OCd1nzNywYQuVZUJpu8pEm8Yo9d3bISkHAtmFwxjZgAgKa44PzNi1RtzthLXS9n_sPh7_dh2nSEuz6cvIBr4DH2AFOg6l7l59IUzD1c4ZEPS39Z-OzoQ4v83RqzmkbCw_EaG_IclIBsd6fDqS6dEfAxrcKUlOYAul8KDthjzf7xufebzkTDVSkdBlR_PB5FKmQyF2wtGqG46QF9qDgAmk9hkw3AfTGKI5PyrskplZAUU7AsVtqork8VvjJI5J21FneyhRtmkJUCWAPSPmDSSmZPBL1BKFBwDI64jnM4uGf_exlWb_9OciUcHtMRzPt790__Kkre7vdwGfhkv0xwEhM_Xfbl1O7hf0uXnxCKmQEe7F40WVFzBUL6yGvBfpynoFtk4jB8Sq9fldCiIZBPiK3UcD_iaZq1liqRfoQ9oBqZ-FguqGCteK0RMCcqnLD17pFeY-7cCF47LV8L7eQiLXnjNOD449cwMAjBVUMi5uu2nSdOYerKgRcHbn_k0G9EB9wHYgLnWVp5TPePcEXA4AUb6PSKFt86iBTwAm-gPwZEQ7bvFGFca1Gn1XZgwLdsRK-b995vwePmM_u5JG8X0lj6mka6KSTScZI4mrAQLyLrCCgjMADn12VV2Ye4NmUtNAXchBr7lDVI_kmqkpUZ7yjdy3VTrsqMvDSL3weI29DSaS6frZ2ZpbhSA3YjWVqCHLbHJO8c8oxF3u7plg73vJAqxFaBPVwNag3LHbmWqGXcTrgG2VTA8tOKbWUHF8gKbARb4PAxWLH5tv2-fBxnhOVzUUIob7bPfIu9Eps4FsCyhazE4E8KanKe3QuxZ3VllVsyshSINTpNkIgXyjB4pKbIIYH_izi2_kolPNk7jWRPAlh1cJohtck_upAddF8LA2gRG1xN3XmGNmfh_bZTvJwtXK6yoiqb7sUeihA_s_DdzF_0C2jkPpsLZrh-mmdEyxCZcbD3XImVUDZJdkOgVt3fbrPKXkwDGbd9KIj95X7KF29eTldhzs4pWQbR5IPQ7T97Owg-QqrW9jRzxxZu5B0b-7XftGzQZLJraCMiw7T1MXbH74ny4wu7Dj4Q5jtxxGotsqmLgrvuaMP3Ldl7DlZkD1__qRmir7IHqP07x6sztvSCZNlfnvh8IqNf3XcdNwjZZ9hC6pz5wE-x5wTYG0Py1qm8Regvk6m8AZ0je3p5YehEQcKIOUA4vKr6OYiild7JC2MHm-upPM9fuH4UuF407IJDxJx4buK9wHHDgH382-s520ZpXnxGmu8lYewFUbAcjmIHYXnWOmAVx_ZFy0y9fie-XJyzjXle6C6CZRwP2O2i_ox1Vpq_DBwvitkP17AL3YvYsShZBFN558EbrEOsuBPkRg1P7FtelBe6fjjEynHqnkfx8meQ5U1lHUhcOG4UhO4yiF1sF3Qy0n3mi4q3GsXM_zm7rbPsLdwW-HH8n3HbgaDvMRtELZ2l999jtnjpxv3lG5gtcvD6dWZbBJeZLfHdcCrvu8wWgUmX15nNO5cdnuv6rg--ib1B3tuYLbrKbGelJdEiDpaLxcLtB97EbEl0ndnO7xJR6Ib-Ypf5V5nNS5YOmP46s12Q13-C0Zd_HbMFl5nNTxJ_kPfXMFsQXWS2wPHjZBEGiyS5ymxDT2x6hMb6h-cWOh4eHm6pfGgEVzg47076RbkugLw9pw_P_Je9q3bVxkkhYFs7vWg6gE7Fg1ognLoKoCChDU7O1dYeq_vj_8B7dHY9IUPqNJCe37rS0Ln-eX_4hRjtsF0b6ri-gSiaWFOtelA09Rk7rN_XAkO9IV7aCsVWX8ZQA4_O-LbpRL2IsbgYy7t_q8EIQcIWzTiL2x4G7LWKwSXYfDZjq4gKP5Rleqig3m-oAlrhPE8UTnMxp0ZpQBM-4a4qxUbY434MW8uGKimUKB9pmtYdPKJQN65trwAoUcWnRItB6vM9DK0XyoZnKgiVwOWW1dxgWLPRh01Xp3gNzh0D4aDA69sHv0n1xBX2mVyf9hM-IDJo5R4WshzSaNM1zGfP-5m29bQqFcZ71Vp4m1CYrxo5h17FXCjVSNK9B1rUJXa0Xi-QqT5tWOzK8bE4GmMfpeSKNOD9jkh1K7trpLEJgNIQ_hp8PcgaSlBMhQ8rnpFiM99HbmqD2FG4ZroVGernMYcwRHtumdEz8j5JGlSEtL7zJcuxIj9R3lavEFg2lQ1MsqEvVG2NiyeTjiuitu2D9suvQ2RrWYsCWQMvlWtb5u6r76M-nKakRXYiGMbCexfq4I5WAEKztQiRxUMf5JhTgJFlin0X4K5CBHbrYtcE2KUz8Lde3gfAtFFxUrp_Yr12-6jHzCEpmhKqYKkCqWSNHvqU1QhhLRrbovry5dcfxhRxbvLbIE-ChN-Y0lTi9qfLvqn5065xNva3Jl2FaacEkAJM6oZZVoRXjxphcOdNp6rbI_ZAvHSpA1rFTVVtxp859H8EyLi1KY30-hCiYAluittVmPPU5Xkep8tliB0_it0ojjnPQx__kpuKw1v6FsdeBGDKsyfRUK_09yU2Hp-yCD9WVnBHrQ04bT_e7g-QNBi-uylvcYDx3dgNPdRoIc6Bq0QIvkijJFrlvh_hWC1qXlYOrUk8eKNurQ1pt9Z4WJXa6P1DSu51I4TVD-vzDqGibu0J3XXjG2vvrTX2X-XL1ZE">