[llvm-dev] [RFC] Should -ffast-math affect intrinsics?

Mon Jul 12 14:28:25 PDT 2021

I've got the following little program that illustrates what I think is a problem. This is for X86/Intel64 intrinsics.

If compiled using
$ clang -O2 intrin_prob.c
$ a.out
2.000000, 3.000000

This is the expected result.  But if compiled using
$ clang -O2 -ffast-math intrin_prob.c
$ a.out
1.500000, 3.255000

This gets incorrect results, because reassociation happens across the calls to the _mm_add_pd, and _mm_sub_pd intrinsics
and the value that should have been added and subtracted gets constant folded to zero.  It seems to me that the fast-math
flags really should not affect intrinsics implementations themselves, and that the fast-math flags should allow reassociation
across the intrinsic calls. So, is this expected behavior, or just something that no-one has noticed before?  It surprised me.
I have also checked GCC behavior, which is consistent with clang, or vice versa.  Intel C/C++ compiler does not have fast math flags
affect intrinsics, at least not for reassociation across the call boundaries and I haven't checked the Microsoft compiler yet.

An easy "fix" would be to add 
#pragma float_control(precise, on)
or
#pragma clang fp  reassociate(off)
near the top of immintrin.h to cause all intrinsics to ignore all fast-math flags, or at least ignore reassociation.

$ cat intrin_prob.c
#include <immintrin.h>
#include <stdio.h>

static union {
  double u1[2];
  __m128d u2;
} t1[1] = {1.25, 3.25};

int main(int argc, char **argv) {
  __m128d t2;
  __m128d t3;
  // This is just so the compiler cannot constant fold
  // and know the values of t1.
  t1[0].u1[0] += argc * 0.25;
  t1[0].u1[1] += argc * .005;

  // This value when added, then subtracted should cause
  // the values to be truncated to integer. If the compiler
  // optimizes the add and subtract out by doing
  // reassociation, then the printed values will have
  // fractional parts.  If the compiler does the intrinsics
  // as expected, then the values printed will have no fractional part.
  t2 = _mm_castsi128_pd(_mm_set_epi32((int)((0x4338000000000000uLL) >> 32),
                                      (int)((0x4338000000000000uLL) >> 0),
                                      (int)((0x4338000000000000uLL) >> 32),
                                      (int)((0x4338000000000000uLL) >> 0)));
  t3 = _mm_add_pd(t1[0].u2, t2);
  t3 = _mm_sub_pd(t3, t2);
  t1[0].u2 = t3;

  printf("%f, %f\n", t1[0].u1[0], t1[0].u1[1]);
  return 0;
}