[PATCH] D37446: [x86] eliminate unnecessary vector compare for AVX masked store

Tue Sep 5 10:37:20 PDT 2017

spatel added inline comments.

================
Comment at: test/CodeGen/X86/masked_memop.ll:1158
 ; SKX-LABEL: trunc_mask:
 ; SKX:       ## BB#0:
 ; SKX-NEXT:    vpxor %xmm1, %xmm1, %xmm1
----------------
aymanmus wrote:
> spatel wrote:
> > aymanmus wrote:
> > > I think the optimal code for SKX is:
> > > vpmovd2m %xmm2, %k1
> > > vmovups %xmm0, (%rdi) {%k1}
> > > 
> > Ok - let me try to shake that out of here. To be clear, we're saying this is the optimal sequence for any CPU with avx512vl/avx512bw. SKX is just an implementation of those ISAs.
> - The IACA tool shows same throughput for both sequences, but the one I suggested has one less uop and one less register.
> - Actually the needed features for vpmovb2m/vpmovw2m are avx512vl+avx512bw and for vpmovd2m/vpmovq2m are avx512vl+avx512dq (which skx also includes)
> - The %y test's operand not used.
I need to confirm what we're saying here. For a 128-bit vector (and similarly for 256-bit), if the machine has avx512 (with all necessary variants), then we would rather see this:
  vpmovd2m %xmm2, %k1
  vmovups %xmm0, (%rdi) {%k1}

than the single instruction that we would produce for a plain AVX machine:
  vmaskmovps %xmm0, %xmm2, (%rdi)

Ie, we want to treat vmaskmovps as legacy cruft and avoid it if we have bitmasks?

https://reviews.llvm.org/D37446