[PATCH] D106280: [X86][AVX] scalar_to_vector(load_scalar()) -> load_vector() for fast dereferencable loads

Tue Jul 20 04:11:59 PDT 2021

lebedev.ri added a comment.

Please note that this patch is very partial.
The actual assembly diff should be as follows:
https://godbolt.org/z/W47nvzc4e

I think from it is it clear that the wide load is unquestionably better.

================
Comment at: llvm/test/CodeGen/X86/load-partial-dot-product.ll:183
 ; AVX-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
-; AVX-NEXT:    vmovsd {{.*#+}} xmm1 = mem[0],zero
+; AVX-NEXT:    vmovups (%rsi), %xmm1
 ; AVX-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,1],mem[0],xmm1[3]
----------------
pengfei wrote:
> RKSimon wrote:
> > efriedma wrote:
> > > Even if we're allowed to do this, it doesn't seem wise; having zero in the high bits of the register is better than random junk.  Can we mark up the loads somehow?
> > Isn't that what the dereferenceable(16) tag is doing?
> I have the same doubt. `dereferenceable(16)` tells the memory of the high bits is available. But shouldn't we always prefer to loading less bytes for performance?
You are comparing apples to oranges here.
The problem here is that `vinsertps` is (obviously) redundant and should go away.
Then it's obviously better - we have one less memory access.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106280/new/

https://reviews.llvm.org/D106280