<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/55202>55202</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Hendiadyoin1
      </td>
    </tr>
</table>

<pre>
    X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to `l{0,2}rint[fl]?`

Clang/LLVM does not seem to replace calls to `rint` with these, and neither does it vectorise these when used to round/convert vectors in all cases.
(truncation is properly replaced)

Some examples follow below

GCC is listed aswell,
The main difference to them is, that they do schedule their `fldcw` for truncation earlier and replace `rintl`, as well as use some bit-magic for `rintf`

Note: Using `f32x4` for `float __vector(4)` and `i32x4` for `int __vector(4)`
Note: `cvtss2si` != `cvttss2si`
Note: Assuming Overflows etc are UB, and HW's behaviour is acceptable

Scenario                   |LLVM                            |GCC                             |Effective instruction(s)
---------------------------|------------------------------- |------------------------------- |-------------------
`rintl`                    | `call    rintl@PLT`            | `frndint`                      | `frndint`
`(int)rintl`               | `call    rintl@PLT` +truncation| `call    rintl@PLT` +truncation| `fistp m16/m32/m64`
`lrintl`                   | `call    lrintl`               | `call    lrintl`               | `fistp m16/m32/m64`
|||
`lrint`                    | `call    lrintl`               | `call    lrintl`               | `cvtss2si r32/r64, xmmX`
`(int)rintf`               | `call    rintf@PLT;cvttss2si`  | Bit magic+`cvttss2si`          | `cvtss2si r32, xmmX`
`(int)rintf (SSE4.2)`      | `roundss + cvttss2si`          | `roundss + cvttss2si`          | `cvtss2si r32, xmmX`
`4x lrintf (f32x4->i32x4)` | 4x (shuffle+`call    lrintl`)  |  4x (shuffle+`call    lrintl`  | `cvtps2dq xmmY, xmmX` 
 
Tested using glodbolt and `x86_64 Clang 14.0.0` as well as `x86_64 GCC 11.2` with O2 and O3 

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJytVsFy4zYM_Rr5golHpiTbOfgQO0n3kG06TbbdnjKUBFnsUKKXpOykX78AJSeOu06ynWo8kiWCwMMDCCA35dPi63wGsi3h7u4KarlFcKrZaARrurZU7TosFqbdovX86ryxCKp13naFV6Z1kVjBrlZFDZJW0DlsvZJaPwF-69RWanoHbyCaxjqaLWOSF9Hs0qrWR9my0lF2GSXXtBrFl1F80d9XWrbrSFzf3PzxGUqDDlrjwSE2rMriRssCoSAzbtAd9E1j2Clfg6_RIQNj9C3SJ7S9GuVhiwU5oRz2YgQeW-gclkE1-02GB5cHYdrXAhkjiw7deEAq5kRCW0imAZSDjTUbtOT4gI_0nB86dWcaIuhRMsEOKqO12UGOdD-U-mW1YmVaOU-QpNuh1uRKv3hfIzSSwJSqqtBiSywQanKkoU3ssa-l5_cn8hdcUWPZ6eCpskxTpctixzxVxsIBfJRWKyKJCdvTO7CqOTbMpQPGwk9iCxx7kyt_1si1KoK-YUN1FMxfjccouYAvjjOIQSTiMd2DCKAMgX546MkmXlNmjtYZDT3VkTzZ-IH0a2P0odh654RTvDMSkyi5HL4-f3695cK5rmGItxR6wrRzgL4Iaf1luc-mT39GYuYobHRalOksx0oWBW68zDW-CneBrbTKwL-vaLYKmf3GRSKcCe-IXFEW0DHcvjqSRIl7Tr2z0xftf2OVL_jPIgMRLxl0woEQET5ZdPWiafzbzf3RhkGwslyT_AltPxB7RkGU8BdxfgLO20gisXw5Kj8vWtFR3kAzmVJdaRLB92l6iE6_QdKRuVOiPyf2DqLZavi9AvixIP4_APdHF2xAZwkdnb_Hpvl6MqrVh6JaDaFKlod1oBddUnMIxYyCeFQo3gH4HjRKizl12HQshsJ2qCq0HOc4deBNmx8WfA9c-tjzH3CFYnwWJVd9le3xsSKS4kpSd1VFha2n5Dh6JN5b_ZD0IcCNE-U3RvbXAULoIQ6PewwtsAttY61NmRvt9z3hcT59mKYQBgWYpON4HIeO8dKkXoS4kE4mRP5-PrgVQc1tMlgalYukPE_O5cgrr3HxWTmeB243XjXqn75DnsHvfV9seKAxVUgnys3gXa-Whin6wKOU22ChKmqLh5PSqLN6UXtynboNCdJvTdu6fFyYhjXp7f5xRrPE31TZ6ZWgdEj1_DrLRCxG9SJOqjzLEyyyeZpPk3OR51k2KaaY5nOZVflISxoq3IKmq0iIFncQVNB_GrVGakFaRJwm8WSSpel8HIs4mSdpKnIxSzHP6IAgjRh6zDjGxq5HdhEg5d3a0SIPJu5lUTqn1i1iMEf6ZedrYxefkGqwLJ-MaiejYH8R8H8HrGL3zw">