<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/58327>58327</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            AArch32 FP16 neon average function produces incorrect result when optimized
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          fbarchard
      </td>
    </tr>
</table>

<pre>
    An average function written with intrinsics produces inaccurate values when optimized.
It works with -O0 but fails with all levels of optimization - Os, -Oz, -O1, -O2
It also works when built for AArch64 with optimization on, but fails with AArch32

The inner loop produces 4 sums

```
const __fp16* i0 = input;
const __fp16* i1 = (const __fp16*) ((uintptr_t) i0 + elements);
const __fp16* i2 = (const __fp16*) ((uintptr_t) i1 + elements);
const __fp16* i3 = (const __fp16*) ((uintptr_t) i2 + elements);
size_t n = elements;
while (n >= 8 * sizeof(__fp16)) {
  const float16x8_t vi0 = vld1q_f16(i0); i0 += 8;
  const float16x8_t vi1 = vld1q_f16(i1); i1 += 8;
  const float16x8_t vi2 = vld1q_f16(i2); i2 += 8;
  const float16x8_t vi3 = vld1q_f16(i3); i3 += 8;
  vsum0 = vaddq_f16(vsum0, vi0);
  vsum1 = vaddq_f16(vsum1, vi1);
  vsum2 = vaddq_f16(vsum2, vi2);
  vsum3 = vaddq_f16(vsum3, vi3);
  n -= 8 * sizeof(__fp16);
}
```
The results are later combined and output as 4 fp16 values:

```float16x4_t vout = vmul_f16(vsum, vmultiplier);
vout = vmax_f16(vout, voutput_min);
vout = vmin_f16(vout, voutput_max);
vst1_f16(o, vout); o += 4;
```

If the code is simplied to do 1 average at a time instead of 4, it works.
If the code computes 4 averages but only outputs 1, it fails.

When tested with 4 rows of 8 elements with random inputs for 0.1 to 10.0, the average is a little off:
```
[ RUN      ] F16_GAVGPOOL_CW__NEONFP16ARITH_X8.elements_eq_8
third_party/XNNPACK/test/gavgpool-cw-microkernel-tester.h:190: Failure
The difference between fp16_ieee_to_fp32_value(y[i]) and y_ref[i] is 0.4029541015625, which exceeds 1.0e-2f * std::abs(y_ref[i]), where
fp16_ieee_to_fp32_value(y[i]) evaluates to 5.375,
y_ref[i] evaluates to 4.9720458984375, and
1.0e-2f * std::abs(y_ref[i]) evaluates to 0.049720458686351776.
at position 1, elements = 8, channels = 4
[  FAILED  ] F16_GAVGPOOL_CW__NEONFP16ARITH_X8.elements_eq_8 (2 ms)
```

[neonfp16arith-x8.txt](https://github.com/llvm/llvm-project/files/9768408/neonfp16arith-x8.txt)

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJydVluP2jgU_jXhxSKKnQvJAw_Mhe5oq5mq6m77FjmJQ7w1MbUdYPrr99hJuAywmlmECDk-5_O5fnYhq9f5okV0yxRdMVR3bWm4bNFOcWMYPLlpEG-N4q3mpUYbJauuZBpktCw7RQ1DWyo6kOwa0Jcbw9f8N6t8L3jwgsWTQTupfuoeaPoSoKIzqKZcDCIqBBJsy-Bd1qM5dT5M0Yv2yD1Y_e4fuH-QAzQVWo74dvei4wLQpUKLhSqbJOr3OEOVrUV544VTDwfg_vdbwyDIlikkpNwcA4-Q7tb6VNNLguHrXkvZaoPyvN7gxCMLxAPkhQ-AtemMF97dUMJOySPpmxWPZFYK3w7KsDEqN1ZkQckdYoKtWWsgTdltaPJBaPx-6PCD0OQmtIamyQ1qHeBBYVzdNVwwi2jXH61KiqwD1krWIB93ztzOs8EMod6vWkhqcLJPYYftUI-tqPCvvLZGKQ96V4a0OvzD3tdB8CUIHkHwe0HIJQgZQch7QcJLkHAECa-BbKGBhxzQqhqNnNSOxnbMxqk-vqqPe318qU-u6pNen1zqh1f1w14_PNcHZvjPDhg1vdnD1QG1k62Y7oTRiCqGBJCYgtSuC96yCtG2QrIzMK2I2nG3sAPHeeHi6uCPBYlsQcC2D2bdiZNYXCggMnwjOFOnjp6Y0P1oAjJn0ruSr3l7w4S3t0zo_sxEGzxoylFt6BM5tkl0TN55zgbKrZGB5JWyAm7UkPu1jaVCRqJKInw4RiikDgHnWgbVhtHKcntkN-XDeeBfAkIBwGvHsAOOdjQtW_E6FEQjPIA47vZPfftuDwAwN-CP4_QIKblzp0p6IJR-RUGJ5bpnZO2Oi8DHNggc-G4ErE9jLBAnRQJOQyAgWdfHFniToPgOff3rGbmPFz-gJU7yT4u_P315efmc33_P8-fHl-flF5wsvj59-yP_kfqjUzn7lac9imm4qvINVebVI8sfz89fFvd_wj8bFzxWdLvaSCmm5W665qWSP5lqmZi6sJXfgHM4C-AXLSE_nWLHhq94XTPF2pKhgpkdg2TZxs45Y8C7EoYnJLnrcuiPV4iGQxCWTO04vOaK1YPMJiTwo4BkcYQDHCcktikDhi4bxPYlYxWUyQ_YlNT9hJrKJi1c0EJb7BMsx9fWlo2uvs8nZoXU9goULfbDmXWhBzhz9Uwv8rMZCaI4zdKot7Cx9Vbv9_ccE_olGlCTNAljPJslQ1fCDGyk5u7O4br20IM9H4OkbChcMEQviY59hJaLp8-PD_-rj-whSdDana23Bxl2aZlsbbYp3PSa6T71zd64ENPGmI2jOrK0PQfLXeHDdMKLENvxMYUb0T-stG1Zw-EMGy6zWZJGAYS2vAo-ejRhc5wkcRJAKYJJNQ-rLMzoxHAYsflwE0M2QmRhLu-mJ3fQUioFPgx0_uYKOumUmH84GK5156KJ05DMJs08DFkUJGmaVUnMwqyMYsLSMKhmpKA1ZrOJoAUUcQ459Qhp2Q45CPgP6ZzwOQkIgUEhJMApEExZF3UYzXAdVFk1K4gXBWwN0-pbP3ypVhM1dy4V3UrDouDa6OMi1ZqvWsbcdoBPO9NINa8LCnmjqpq4zefO-X8BTWeMAQ">