<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/62735>62735</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [aarch64] gcc generate better code than clang
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          vfdff
      </td>
    </tr>
</table>

<pre>
    * test case, see https://gcc.godbolt.org/z/T1qzcWjTK
```
vec mla1(vec v0, vec v1, int v2)
{
   return v0 - v1 * v2;
}
```
* gcc: fmov + mls
```
        fmov    s31, w0
        mls     v0.4s, v1.4s, v31.s[0]
        ret
```
* llvm: dup + mls
```
        dup v2.4s, w0
        mls     v0.4s, v2.4s, v1.4s
        ret
```
For some targets (I'm not sure all targets) , the latency of dup instruction from scalar register to vector register is much more than 1, while the latency of fmov from scalar register to float register is usual 1, so the gcc's assemble is better.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyMk09vszgQxj-NuYyKzAAhOXBINkJa7bXSno0Z_lQGdz0DVfvpVyap3rdVq7dRJA_4mWd-ZjyGeRoWolqVF1VeE7PK6EO99V3fJ63vXmuFZxBiAWuYFP4FTASjyDOr_KywUdgM1qaD71rvJPVhUNi8KWwes__e7L9Pj_8ofVX6rA76_t8fN7IwO5MpPMZw09F6j7IYTYvAhgpP9-TqcgsAIJCsYYFNwwNsGUS-DVV-eVdev6wXZYO1Kj9DP_sNFF5gdvylFu6_XQgAnO9ML5-2Z8f7uum04B0_ew_yLGVVXrQqrx9zAsm3eM5tc-Tr1ucf4UXdhveSf4bDD5Q_omp8APYzgZgwkDAoPP6tsJph8QK8BgLj3PuuwhPEAjISOCO02Ffw_Y45LSxhtTL5BfrgZ2BrnAkQaJhYKID42Hzxv72aGObVjjD7QCCjWeDWhXFy9LnG3qnvjHvnjXzwXXk17mbHfveKVwMrBsNMc-soqloSoZAmXZ13p_xkEqqzw7HAotBFmYy17c2pxKooTx0ZjR3pqs2wOp6qviN9KJOpRo25LrNDlueFLtNjlhe6pTwrDFJpSlVoms3k0tj6ODrJxLxSfcAqLxNnWnK8TybiQi-wbyrEOKihjjkP7TqwKrSbWPiXi0zi9pE2JtjxUKjyGk8IAy0UjND9aGB9d_-01pllSNbg6k-jPcm4tqn1s8Jmv6C35eE5-CeyorDZqVhhs1P_HwAA___UmUL6">