[PATCH] D32416: [x86, SSE] AVX1 PR28129

Tue Apr 25 08:07:57 PDT 2017

RKSimon added inline comments.

================
Comment at: test/CodeGen/X86/vector-pcmp.ll:156-158
+; AVX1-NEXT:    vxorps %ymm1, %ymm1, %ymm1
+; AVX1-NEXT:    vcmptrueps %ymm1, %ymm1, %ymm1
 ; AVX1-NEXT:    vxorps %ymm1, %ymm0, %ymm0
----------------
spatel wrote:
> RKSimon wrote:
> > spatel wrote:
> > > That's an interesting case...that we probably can't answer at the DAG level. Would it be better to use two 128-bit vpxor instructions instead of incurring a potential domain-crossing penalty with the one 256-bit vxorps?
> > Do you mean this? 
> > ```
> > vextractf128 $1, %ymm0, %xmm1
> > vpxor %xmm2, %xmm2, %xmm2
> > vpcmpgtb %xmm1, %xmm2, %xmm1
> > vpcmpgtb %xmm0, %xmm2, %xmm0
> > vcmpeqd %xmm2, %xmm2, %xmm2
> > vpxor %xmm2, %xmm1, %xmm1
> > vpxor %xmm2, %xmm0, %xmm0
> > vinsertf128 $1, %xmm1, %ymm0, %ymm0
> > ```
> Yes - I remember reading somewhere (and not sure how widely this applies) that the 'insertX128' insts may not actually have domain-crossing penalties. The other variable in this mix (thinking about Jaguar here) is that the 256-bit ops may be cracked and double-pumped anyway, so if we have that + domain-crossing penalty, then the two 128-bit insts should be faster?
A quick hot loop test suggests that the old vpcmpeqd+vinsertf128+xor approach takes 8cy, the 256-bit xor+vcmptrueps+xor approach takes 7cy and the 128-bit vpcmpeqd+2*xor takes 6cy on Jaguar.

It might be worth looking at splitting some 256-bit bitwise operations that take concatenated 128-bit operations, but I don't think it should get in the way of this patch.

https://reviews.llvm.org/D32416