[PATCH] D37446: [x86] eliminate unnecessary vector compare for AVX masked store

Wed Sep 6 00:42:27 PDT 2017

aymanmus added inline comments.

================
Comment at: test/CodeGen/X86/masked_memop.ll:1158
 ; SKX-LABEL: trunc_mask:
 ; SKX:       ## BB#0:
 ; SKX-NEXT:    vpxor %xmm1, %xmm1, %xmm1
----------------
spatel wrote:
> aymanmus wrote:
> > spatel wrote:
> > > aymanmus wrote:
> > > > I think the optimal code for SKX is:
> > > > vpmovd2m %xmm2, %k1
> > > > vmovups %xmm0, (%rdi) {%k1}
> > > > 
> > > Ok - let me try to shake that out of here. To be clear, we're saying this is the optimal sequence for any CPU with avx512vl/avx512bw. SKX is just an implementation of those ISAs.
> > - The IACA tool shows same throughput for both sequences, but the one I suggested has one less uop and one less register.
> > - Actually the needed features for vpmovb2m/vpmovw2m are avx512vl+avx512bw and for vpmovd2m/vpmovq2m are avx512vl+avx512dq (which skx also includes)
> > - The %y test's operand not used.
> I need to confirm what we're saying here. For a 128-bit vector (and similarly for 256-bit), if the machine has avx512 (with all necessary variants), then we would rather see this:
>   vpmovd2m %xmm2, %k1
>   vmovups %xmm0, (%rdi) {%k1}
> 
> than the single instruction that we would produce for a plain AVX machine:
>   vmaskmovps %xmm0, %xmm2, (%rdi)
> 
> Ie, we want to treat vmaskmovps as legacy cruft and avoid it if we have bitmasks?
Actually it seems like both are equivalent in skx. They show the same throughput and number of uops, the same ports are used and the latency on each port is equal.

Nonetheless, I think we should prefer the vpmovd2m alternative because it provides a full set of instructions for all possible type granularities (byte, word, double-word and quad-word) while the AVX vmaskmovps are only available for 32-bit and 64-bit.

https://reviews.llvm.org/D37446