[PATCH] [X86, AVX] use blends instead of insert128 with index 0

Wed Mar 18 10:24:44 PDT 2015

Hi Sanjay,


================
Comment at: test/CodeGen/X86/avx-cast.ll:43-46
@@ -24,1 +42,6 @@
 define <4 x i64> @castC(<2 x i64> %m) nounwind uwtable readnone ssp {
+; AVX1-LABEL: castC:
+; AVX1:         vxorps %xmm1, %xmm1, %xmm1
+; AVX1-NEXT:    vinsertf128 $0, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    retq
+;
----------------
So, the reason why your code doesn't optimize this case for AVX1 is because AVX1 doesn't support integer blend on YMM registers.

However, wouldn't the following code be faster in this case (at least on Intel cpus)?
  vxorps %ymm1, %ymm1, %ymm1
  vblendps $0, %ymm0, %ymm1, %ymm0

I understand that we want to avoid domain-crossing as much as possible.
However, in this particular case I don't think it is possible (please correct me if I am wrong).
Your code would fall back to selecting a 'vinsertf128'. However, as far as I know 'vinsertf128' is floating point cluster anyway. So, I expect (I haven't tested it though) that using 'vblendps/d' would probably give us the same (or better on Haswell?) throughput. What do you think?

http://reviews.llvm.org/D8366

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/