[llvm-bugs] [Bug 40759] New: [X86] Compiler wrongly emits temporal stores when vectorizing a scalar nontemporal memcpy loop.
via llvm-bugs
llvm-bugs at lists.llvm.org
Mon Feb 18 04:04:14 PST 2019
https://bugs.llvm.org/show_bug.cgi?id=40759
Bug ID: 40759
Summary: [X86] Compiler wrongly emits temporal stores when
vectorizing a scalar nontemporal memcpy loop.
Product: libraries
Version: trunk
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: andrea.dibiagio at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, spatel+llvm at rotateright.com
Example:
```
void foo(unsigned *A, unsigned *B, unsigned Elts) {
for (unsigned I = 0; I < Elts; ++I) {
unsigned X = A[I];
__builtin_nontemporal_store(X, &B[I]);
}
}
```
> clang -O2 -march=btver2 -S -o -
```
.LBB0_8: # %for.body
movl (%rdi,%rcx,4), %edx
movntil %edx, (%rsi,%rcx,4) # <<== OK. Nontemporal store.
incq %rcx
cmpq %rcx, %rax
jne .LBB0_8
retq
.LBB0_5: # %vector.ph
movl %eax, %ecx
xorl %edx, %edx
andl $-32, %ecx
.p2align 4, 0x90
.LBB0_6: # %vector.body
vmovups (%rdi,%rdx,4), %ymm0
vmovups 32(%rdi,%rdx,4), %ymm1
vmovups 64(%rdi,%rdx,4), %ymm2
vmovups 96(%rdi,%rdx,4), %ymm3
vmovups %ymm0, (%rsi,%rdx,4) # <<== WRONG. Temporal vector store.
vmovups %ymm1, 32(%rsi,%rdx,4) # Same...
vmovups %ymm2, 64(%rsi,%rdx,4) # Same...
vmovups %ymm3, 96(%rsi,%rdx,4) # Same...
addq $32, %rdx
cmpq %rdx, %rcx
jne .LBB0_6
# %bb.7: # %middle.block
cmpq %rax, %rcx
jne .LBB0_8
```
On X86, (V)MOVNTPS can be used to do non-temporal vector stores.
However, VMOVNTPS requires that the memory operand for the destination is
aligned by 16-bytes (for the 128-bit stores), or 32-bytes (for the 256-bit
stores).
In this example, store instructions are marked as 4-bytes aligned.
When the loop vectorizer kicks in, it generates a vector loop body, and all
vector stores are correctly annotated with metadata flag "!nontemporal" and
aligment 4.
However, on x86 there is no support for unaligned nontemporal stores.
So, ISel falls back to selecting normal (i.e. "temporal") unaligned stores (see
the VMOVUPS from the assembly above).
When vectorizing a memcpy-like loop, we should probably check if the target has
support for unaligned nontemporal vector stores before transforming the loop.
Otherwise, we risk to accidentally introduce temporal stores that pollute the
caches.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190218/e6f1c5db/attachment.html>
More information about the llvm-bugs
mailing list