[PATCH] D86855: Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.

Wed Jan 6 20:30:36 PST 2021

jyknight marked 7 inline comments as done.
jyknight added a comment.
Herald added a subscriber: pengfei.

I've finally got back to moving this patch forward -- PTAL, thanks!

To start with, I wrote a simple test-suite to verify the functionality of these changes. I've included the tests I wrote under mmx-tests/ in this version of the patch -- although it doesn't actually belong there. I'm not exactly sure where it _does_ belong, however.

The test-suite runs a number of combinations of inputs against two different compilers' implementations of these intrinsics, and makes sure they produce identical results. I used this to ensure that there are no changes in behavior between old clang and clang after this change, as well as compared clang to GCC. Using that, I've fixed and verified all the bugs you noticed in codereview already, as well as additional bugs the testsuite found (in _mm_maddubs_pi16 and _mm_shuffle_pi8). I'm feeling reasonably confident, now, that this change will not change behavior of these functions. The tests also discovered two bugs in GCC, https://gcc.gnu.org/PR98495, https://gcc.gnu.org/PR98522.

Some other changes in this update:

I switched _mm_extract_pi16 and _mm_insert_pi16 back to using an clang intrinsic, for consistency with the other extract/insert macros, which are using an intrinsic function simply to force the element-number to be a compile-time constant, and produce an error when it's not. But, the intrinsic now lowers to generic IR like all the other `__builtin_ia32_vec_{ext,set}_*`, rather than an llvm intrinsic forcing MMX. I modified the "composite" functions in xmmintrin.h to directly use 128-bit operations, instead of composites of multiple 64bit operations, where possible.

Finally, the clang tests have been updated, so that all tests pass again.

================
Comment at: clang/lib/Headers/mmintrin.h:367
 {
-    return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2);
+    return (__m64)(((__v8qi)__m1) + ((__v8qi)__m2));
 }
----------------
craig.topper wrote:
> I think you we should use __v8qu to match what we do in emmintrin.h. We don't currently set nsw on signed vector arithmetic, but we should be careful in case that changes in the future.
Done, here and everywhere else I was using signed math (except the comparisons).

================
Comment at: clang/lib/Headers/mmintrin.h:1097
 {
-    return __builtin_ia32_pand((__v1di)__m1, (__v1di)__m2);
+    return (__m64)(((__v1di)__m1) & ((__v1di)__m2));
 }
----------------
craig.topper wrote:
> I think we probably want to use a v2su or v2si here. Using v1di scalarizes and splits on 32-bit targets. On 64-bit targets it emits GPR code.
AFAICT, this doesn't matter? It seems to emit GPR or XMM code just depending on whether the result values are needed as XMM or not, independent of whether the type is specified as v2su or v1du.

================
Comment at: clang/lib/Headers/mmintrin.h:1242
 {
-    return (__m64)__builtin_ia32_pcmpgtb((__v8qi)__m1, (__v8qi)__m2);
+    return (__m64)((__v8qi)__m1 > (__v8qi)__m2);
 }
----------------
craig.topper wrote:
> Need to use __v8qs here to force "signed char" elements. __v8qi uses "char" which has platform dependent signedness or can be changed with a command line.
Done.

================
Comment at: clang/lib/Headers/mmintrin.h:1264
 {
-    return (__m64)__builtin_ia32_pcmpgtw((__v4hi)__m1, (__v4hi)__m2);
+    return (__m64)((__v4hi)__m1 > (__v4hi)__m2);
 }
----------------
craig.topper wrote:
> Same here
This is a short, which is always signed, so it should be ok as written.

================
Comment at: clang/lib/Headers/mmintrin.h:1394
 {
-    return _mm_set_pi32(__i, __i);
+    return __extension__ (__m64)(__v2si){__i, __i};
 }
----------------
craig.topper wrote:
> Is this needed?
No, reverted this change and the others like it.

================
Comment at: clang/lib/Headers/mmintrin.h:1452
 {
-    return _mm_set_pi32(__i1, __i0);
+    return __extension__ (__m64)(__v2si){__i1, __i0};
 }
----------------
craig.topper wrote:
> I don't think this change is needed. And I think the operands are in the wrong order.
Change was unnecessary, so reverted. (But operands are supposed to be backwards here.)

================
Comment at: clang/lib/Headers/tmmintrin.h:17
 #define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("ssse3"), __min_vector_width__(64)))
-#define __DEFAULT_FN_ATTRS_MMX __attribute__((__always_inline__, __nodebug__, __target__("mmx,ssse3"), __min_vector_width__(64)))
+#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
+#define __anyext128(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, -1)
----------------
craig.topper wrote:
> I'm worried that using v1di with the shuffles will lead to scalarization in the type legalizer. Should we use v2si instead?
Converting `__trunc64` to v4si (and thus v2si return value) seems to make codegen _worse_ in some cases, and I don't see any case where it gets better. 

For example,
```
#define __trunc64_1(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
#define __trunc64_2(x) (__m64)__builtin_shufflevector((__v4si)(x), __extension__ (__v4si){}, 0, 1)
__m64 trunc1(__m128 a, int i) { return __trunc64_1(__builtin_ia32_psllqi128(a, i)); }
__m64 trunc2(__m128 a, int i) { return __trunc64_2(__builtin_ia32_psllqi128(a, i)); }
}
```

In trunc2, you get two extraneous moves at the end:
```
        movd    %edi, %xmm1
        psllq   %xmm1, %xmm0
        movq    %xmm0, %rax // extra
        movq    %rax, %xmm0 // extra
```

I guess that's related to calling-convention lowering which turns m64 into "double" confusing the various IR simplifications?

Similarly, there's also extraneous moves to/from a GPR for argument passing sometimes. But I don't see an easy way around that. Both variants do that here, instead of just `movq    %xmm0, %xmm0`:

```
#define __anyext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, -1)
#define __anyext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, -1, -1)
#define __zext128_1(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, 1)
#define __zext128_2(x) (__m128i)__builtin_shufflevector((__v2si)(x), __extension__ (__v2si){}, 0, 1, 2, 3)

__m128 ext1(__m64 a) { return __builtin_convertvector((__v4si)__zext128_1(a), __v4sf)); }
__m128 ext2(__m64 a) { return __builtin_convertvector((__v4si)__zext128_2(a), __v4sf)); }
```

Both produce:
```
        movq    %xmm0, %rax
        movq    %rax, %xmm0
        cvtdq2ps        %xmm0, %xmm0
        retq
```

However, switching to variant 2 of `anyext128` and `zext128` does seem to improve things in other cases, avoiding _some_ of those sorts of extraneous moves to a scalar register and back again. So I've made that change.

================
Comment at: clang/lib/Headers/xmmintrin.h:2316
 {
-  return __builtin_ia32_pmovmskb((__v8qi)__a);
+  return __builtin_ia32_pmovmskb128((__v16qi)__anyext128(__a));
 }
----------------
craig.topper wrote:
> This doesn't guarantee zeroes in bits 15:8 does it?
It does not. Switched to zext128.

================
Comment at: clang/lib/Headers/xmmintrin.h:2404
+  __m128i __n128  = __zext128(__n);
+  if (((__SIZE_TYPE__)__p & 0xfff) > 4096-15 && ((__SIZE_TYPE__)__p & 0xfff) < 4096-8) {
+    __p -= 8;
----------------
craig.topper wrote:
> Does this work with large pages?
Yes -- this needs to be the boundary at which a trap _might_ occur if we crossed it. Whether it's in fact the end of of a page or not is irrelevant, only that it _could_ be.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86855/new/

https://reviews.llvm.org/D86855