[PATCH] D106280: [X86][AVX] scalar_to_vector(load_scalar()) -> load_vector() for fast dereferencable loads
Roman Lebedev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Jul 20 04:11:59 PDT 2021
lebedev.ri added a comment.
Please note that this patch is very partial.
The actual assembly diff should be as follows:
https://godbolt.org/z/W47nvzc4e
I think from it is it clear that the wide load is unquestionably better.
================
Comment at: llvm/test/CodeGen/X86/load-partial-dot-product.ll:183
; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
-; AVX-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
+; AVX-NEXT: vmovups (%rsi), %xmm1
; AVX-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],mem[0],xmm1[3]
----------------
pengfei wrote:
> RKSimon wrote:
> > efriedma wrote:
> > > Even if we're allowed to do this, it doesn't seem wise; having zero in the high bits of the register is better than random junk. Can we mark up the loads somehow?
> > Isn't that what the dereferenceable(16) tag is doing?
> I have the same doubt. `dereferenceable(16)` tells the memory of the high bits is available. But shouldn't we always prefer to loading less bytes for performance?
You are comparing apples to oranges here.
The problem here is that `vinsertps` is (obviously) redundant and should go away.
Then it's obviously better - we have one less memory access.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D106280/new/
https://reviews.llvm.org/D106280
More information about the llvm-commits
mailing list