<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/61218>61218</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            ARM MVE - VFMAS instruction never generated if the scalar is constant
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          kjbracey
      </td>
    </tr>
</table>

<pre>
    The VFMAS instruction is quite rarely used compared to VFMA, and when it is applicable it will often be used with a constant scalar, eg in a Newton-Raphson inverse square root approximation step:

    x = x * (1.5 - 0.5 * x * x)            // Can use VMUL ; VFMAS ; VMUL

MVE gives us two VFMA forms (`V * V + V`, `V * S + V`), and the VFMAS form `V * V + S`.  A key difference is which input register is modified - for VFMA it's the addend, while VFMAS writes back to one of the multiplicands.

Clang can generate VFMAS either from the `vfmasq` intrinsic, or from a `float32x4_t * float32x4_t + float32_t` expression, but **only** if the scalar is not a known constant. If the scalar is constant, it is always loaded into a vector register, and it uses the all-vector `VFMA`, even though this inevitably means accompanying every `VFMA` with a `VMOV`as the constant addend gets overwritten.

I've had no success in generating a VFMAS instruction for a constant scalar, so my Newton-Raphson iterations are VMUL; VMOV; VFMA; VMUL.

Non-constant scalar:

```
float32x4_t func3(float32x4_t a, float32x4_t b, float32_t c)
{
    a = vfmasq(a,b,c);
    a = vfmasq(a,b,c);
 return vfmasq(a,b,c);
}

func3:
        vmov    r0, s8
 vfmas.f32       q0, q1, r0
        vfmas.f32       q0, q1, r0
 vfmas.f32       q0, q1, r0
        bx      lr
```

Constant scalar:

```
float32x4_t func1(float32x4_t a, float32x4_t b)
{
    a = vfmasq(a,b,1.5f);
    a = vfmasq(a,b,1.5f);
 return vfmasq(a,b,1.5f);
}

func1:
        vmov.f32        q2, #1.500000e+00
        vmov    q3, q2
        vfma.f32        q3, q0, q1
        vmov    q0, q2
        vfma.f32        q0, q3, q1
 vfma.f32        q2, q0, q1
        vmov    q0, q2
        bx lr
```

More examples at https://godbolt.org/z/cc5navr54

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJykVm-P2rgT_jTmzQiUOATCC17sdn9IlUortb_u28pxJsRdxwbb4c99-tM4sA0sd7e9i1BCxo-fGc8zY0d4rzYGccnyR5Y_jUQXGuuWLz9LJySeRqWtTsv_NwjPq_XDN1DGB9fJoKwB5WHXqYDghEN9gs5jBdK2W-GwgmDjFMY_gDAVHBo0oAJNEtutVlKUGslwUFqDrQMaKLHnOKjQgABpjQ_CBPBSaOGICTegDAj4jIdgzfir2DaeIjF7dB7B7zrhEJy1gbw4e1StiLH6gFuWPbDkiSXnOwDAEVj2RHf-AIwX6SSHMSSTPBp685HxBQwuxleMr-CDMBQsPK-_fwKWPZ7zE_-tv38aOlo__w82ao8eOg_h0OcFautaT07ZLHmOnuj-CM9sltBSX83fBubFJZ3hVRHigRuOb2yWTAAe4AVPUKm6RodGIiX_0CjZgDLbLoDDjfIBHdlbW6laYQVjYuxDVIHxuY--RFWhqcj7oVH64vvgVEAPpZAvpLc1CLaO-LbTQUWZTeUnw2x80MJsQAoDGzToRLiQoQoNOqidbSMFmyX7uhV-x2YJKBOcMl5JCsGeUYIwtbYiZPw4_RFiBq7fHy_vPwLR4HHr0HtlDfGUXZzC-IM1-tT_A9UvoK85yoyhYoIXYw_mtSQn8PEWdhki4nOd64M4edBWVFjRCiwI2KMM1r2m_qKnClRN51xrPT7DSFdqor4kcI8GQmO7TQOhUR6Uwb0KotQnaFEYD0LG_jMnZTYEd6cBxaWvyLL-QgUleoevjdarDBsMHuweHekb0Fzp95Hx-R6hERUYC76TEj1FcpGTPIs7uwVV1d2e9hba05uODpHL0ppc32V9a315vjTbpdWuovtszfjWx3XbUy77X3wdlkvdGZkxXgxtgkIcGsqB4UcASU3ZE88ff20sIm4s5_rlBbHQxIjOfhfnMHTO_AOKzZ-Gy-zXkg32Orr2rd3T08V68sV5NDJP6oyfYbs4vEvp7pIbivdgf4ewPPZP7e4KdN41_oum6Ts0_Q0R00lev1PHN9C_kvIWeEfN9K6agyTDjseDg2fpJE_oQsYfk1v9ziWwy6Ik_K26V5Q96iLffabkXUw9KrtiegPi_9Zdefz7Alpbh4BH0W41ehABmhC2nlIaz_ONrUqrw8S6DeOrPxhfSZkbsXf5tJ8_qpZZtcgWYoTLdDYv0nQ-KxajZjkvZYq8mEpelNOyWGC9mGaiknmRlwueVyO15AnPkiyZpfm0SBYTXORTmUpRFrPZPCklmybYCqUnWu9bimCkvO9wOUt5Woy0KFH7-HnGucEDxEHGOX2tuSXNGZfdxrNpopUP_hdLUEHj8uHrGugLZHxnTzZ0RLyew9Xbs--ylY46p5c3GVOh6cqJtC3jK_J5foy3zv5EGRhfxUg946u4kj8DAAD__8xBCtE">