[PATCH] D32416: [x86, SSE] AVX1 PR28129

Tue Apr 25 07:34:10 PDT 2017

spatel added inline comments.

================
Comment at: test/CodeGen/X86/vector-pcmp.ll:156-158
+; AVX1-NEXT:    vxorps %ymm1, %ymm1, %ymm1
+; AVX1-NEXT:    vcmptrueps %ymm1, %ymm1, %ymm1
 ; AVX1-NEXT:    vxorps %ymm1, %ymm0, %ymm0
----------------
RKSimon wrote:
> spatel wrote:
> > That's an interesting case...that we probably can't answer at the DAG level. Would it be better to use two 128-bit vpxor instructions instead of incurring a potential domain-crossing penalty with the one 256-bit vxorps?
> Do you mean this? 
> ```
> vextractf128 $1, %ymm0, %xmm1
> vpxor %xmm2, %xmm2, %xmm2
> vpcmpgtb %xmm1, %xmm2, %xmm1
> vpcmpgtb %xmm0, %xmm2, %xmm0
> vcmpeqd %xmm2, %xmm2, %xmm2
> vpxor %xmm2, %xmm1, %xmm1
> vpxor %xmm2, %xmm0, %xmm0
> vinsertf128 $1, %xmm1, %ymm0, %ymm0
> ```
Yes - I remember reading somewhere (and not sure how widely this applies) that the 'insertX128' insts may not actually have domain-crossing penalties. The other variable in this mix (thinking about Jaguar here) is that the 256-bit ops may be cracked and double-pumped anyway, so if we have that + domain-crossing penalty, then the two 128-bit insts should be faster?

https://reviews.llvm.org/D32416