[PATCH] [X86, AVX] use blends instead of insert128 with index 0

Sanjay Patel spatel at rotateright.com
Wed Mar 18 14:17:45 PDT 2015

Comment at: test/CodeGen/X86/avx-cast.ll:43-46
@@ -24,1 +42,6 @@
 define <4 x i64> @castC(<2 x i64> %m) nounwind uwtable readnone ssp {
+; AVX1-LABEL: castC:
+; AVX1:         vxorps %xmm1, %xmm1, %xmm1
+; AVX1-NEXT:    vinsertf128 $0, %xmm0, %ymm1, %ymm0
+; AVX1-NEXT:    retq
andreadb wrote:
> So, the reason why your code doesn't optimize this case for AVX1 is because AVX1 doesn't support integer blend on YMM registers.
> However, wouldn't the following code be faster in this case (at least on Intel cpus)?
>   vxorps %ymm1, %ymm1, %ymm1
>   vblendps $0, %ymm0, %ymm1, %ymm0
> I understand that we want to avoid domain-crossing as much as possible.
> However, in this particular case I don't think it is possible (please correct me if I am wrong).
> Your code would fall back to selecting a 'vinsertf128'. However, as far as I know 'vinsertf128' is floating point cluster anyway. So, I expect (I haven't tested it though) that using 'vblendps/d' would probably give us the same (or better on Haswell?) throughput. What do you think?
Hi Andrea -

Thanks for the close reading!

Yes - if you only have AVX and you get to this point, then there's no avoiding the domain-crossing because you won't have vinserti128 either. 

I'll redo the check to account for this case.



More information about the llvm-commits mailing list