[PATCH] D86855: Convert __m64 intrinsics to unconditionally use SSE2 instead of MMX instructions.

Sun Aug 30 22:44:46 PDT 2020

craig.topper added inline comments.

================
Comment at: clang/lib/Headers/mmintrin.h:367
 {
-    return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2);
+    return (__m64)(((__v8qi)__m1) + ((__v8qi)__m2));
 }
----------------
I think you we should use __v8qu to match what we do in emmintrin.h. We don't currently set nsw on signed vector arithmetic, but we should be careful in case that changes in the future.

================
Comment at: clang/lib/Headers/mmintrin.h:1097
 {
-    return __builtin_ia32_pand((__v1di)__m1, (__v1di)__m2);
+    return (__m64)(((__v1di)__m1) & ((__v1di)__m2));
 }
----------------
I think we probably want to use a v2su or v2si here. Using v1di scalarizes and splits on 32-bit targets. On 64-bit targets it emits GPR code.

================
Comment at: clang/lib/Headers/mmintrin.h:1242
 {
-    return (__m64)__builtin_ia32_pcmpgtb((__v8qi)__m1, (__v8qi)__m2);
+    return (__m64)((__v8qi)__m1 > (__v8qi)__m2);
 }
----------------
Need to use __v8qs here to force "signed char" elements. __v8qi uses "char" which has platform dependent signedness or can be changed with a command line.

================
Comment at: clang/lib/Headers/mmintrin.h:1264
 {
-    return (__m64)__builtin_ia32_pcmpgtw((__v4hi)__m1, (__v4hi)__m2);
+    return (__m64)((__v4hi)__m1 > (__v4hi)__m2);
 }
----------------
Same here

================
Comment at: clang/lib/Headers/mmintrin.h:1394
 {
-    return _mm_set_pi32(__i, __i);
+    return __extension__ (__m64)(__v2si){__i, __i};
 }
----------------
Is this needed?

================
Comment at: clang/lib/Headers/mmintrin.h:1452
 {
-    return _mm_set_pi32(__i1, __i0);
+    return __extension__ (__m64)(__v2si){__i1, __i0};
 }
----------------
I don't think this change is needed. And I think the operands are in the wrong order.

================
Comment at: clang/lib/Headers/tmmintrin.h:17
 #define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("ssse3"), __min_vector_width__(64)))
-#define __DEFAULT_FN_ATTRS_MMX __attribute__((__always_inline__, __nodebug__, __target__("mmx,ssse3"), __min_vector_width__(64)))
+#define __trunc64(x) (__m64)__builtin_shufflevector((__v2di)(x), __extension__ (__v2di){}, 0)
+#define __anyext128(x) (__m128i)__builtin_shufflevector((__v1di)(x), __extension__ (__v1di){}, 0, -1)
----------------
I'm worried that using v1di with the shuffles will lead to scalarization in the type legalizer. Should we use v2si instead?

================
Comment at: clang/lib/Headers/xmmintrin.h:2316
 {
-  return __builtin_ia32_pmovmskb((__v8qi)__a);
+  return __builtin_ia32_pmovmskb128((__v16qi)__anyext128(__a));
 }
----------------
This doesn't guarantee zeroes in bits 15:8 does it?

================
Comment at: clang/lib/Headers/xmmintrin.h:2404
+  __m128i __n128  = __zext128(__n);
+  if (((__SIZE_TYPE__)__p & 0xfff) > 4096-15 && ((__SIZE_TYPE__)__p & 0xfff) < 4096-8) {
+    __p -= 8;
----------------
Does this work with large pages?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86855/new/

https://reviews.llvm.org/D86855