[llvm-bugs] [Bug 47426] New: _mm_load_si128() expands to two movq's.
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri Sep 4 19:18:18 PDT 2020
https://bugs.llvm.org/show_bug.cgi?id=47426
Bug ID: 47426
Summary: _mm_load_si128() expands to two movq's.
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: anmparal at cisco.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, spatel+llvm at rotateright.com
Given the test:
1 #include <stdint.h>
2 #include <xmmintrin.h>
3 #include <x86intrin.h>
4
5 uint32_t read_128b(__m128i *ptr)
6 {
7 __m128i val = _mm_load_si128(ptr);
8 return ((uint32_t *) &val)[0]|
9 ((uint32_t *) &val)[1]|
10 ((uint32_t *) &val)[2]|
11 ((uint32_t *) &val)[3];
12 }
With clang version 12.0.0 (https://github.com/llvm/llvm-project.git
4eef14f9780d9fc9a88096a3cabd669bcfa02bbc 09/04/2020) the _mm_load_si128()
is translated at '-O2 -msse2' to:
movq (%rdi), %rcx
movq 8(%rdi), %rdx
This is not in accordance with Ref. [0], which specifies:
Synopsis
__m128i _mm_load_si128 (__m128i const* mem_addr)
#include <emmintrin.h>
Instruction: movdqa xmm, m128
CPUID Flags: SSE2
(Note: gcc-10.1.0 and icc.16.0.5.027b both generate a movdqa as expected).
The accesses at lines 8 thro' 11 cause the problematic 64-bit loads; modifying
the code (see marker: '<<<') so that:
1 #include <stdint.h>
2 #include <xmmintrin.h>
3 #include <x86intrin.h>
4
5 uint32_t read_128b(__m128i *ptr, uint8_t index) <<<
6 {
7 __m128i val = _mm_load_si128(ptr);
8 return ((uint32_t *) &val)[index]; <<<
9 }
- we see that the _mm_load_si128() is translated to: movaps (%rdi), %xmm0
as expected. (Note: Per Ref. [1], movaps and movdqa are interchangeable).
The _mm_load_si128() builtin is defined in: clang/lib/Headers/emmintrin.h
with attribute: __min_vector_width__(128)
#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, \
__target__("sse2"), \
__min_vector_width__(128)))
...
/// ...
/// This intrinsic corresponds to the <c> VMOVDQA / MOVDQA </c> instruction.
/// ...
static __inline__ __m128i __DEFAULT_FN_ATTRS
_mm_load_si128(__m128i const *__p)
{
return *__p;
}
Per Ref. [2], "This attribute may be attached to a function and informs the
backend that this function desires vectors of at least this width to be
generated. ... This attribute is meant to be a hint to control target
heuristics that may generate narrower vectors than what the target hardware
supports." So, it is reasonable to expect that the vector pointed to by '__p'
is always treated in its 128-bit entirety.
_mm_load_si128() is converted to the following optimal LLVM IR:
; Function Attrs: alwaysinline norecurse nounwind readonly uwtable
define internal fastcc <2 x i64> @_mm_load_si128(<2 x i64>* nocapture readonly
%__p) unnamed_addr #2 {
entry:
%0 = load <2 x i64>, <2 x i64>* %__p, align 16, !tbaa !2
ret <2 x i64> %0
}
The Function Integration/Inlining pass inlines this _mm_load_si128() body into
read_128b():
%0 = load <2 x i64>, <2 x i64>* %ptr, align 16, !tbaa !2
However, (owing to the 32-bit accesses in the subsequent |-expression),
the Combine redundant instructions pass converts this load to:
%1 = load i128, i128* %0, align 16, !tbaa !2
- which, the X86 DAG->DAG Instruction Selection pass converts to:
%1:gr64 = MOV64rm %0:gr64, 1, $noreg, 0, $noreg :: \
(load 8 from %ir.0, align 16, !tbaa !2)
%2:gr64 = MOV64rm %0:gr64, 1, $noreg, 8, $noreg :: \
(load 8 from %ir.0 + 8, align 16, !tbaa !2)
- the problematic 64-bit loads.
Per Ref. [3]/Rationale: "Platforms may rely on volatile loads and stores of
natively supported data width to be executed as single instruction. For
example, in C this holds for an l-value of volatile primitive type with native
hardware support, but not necessarily for aggregate types. The frontend
upholds
these expectations, which are intentionally unspecified in the IR. The rules
above ensure that IR transformations do not violate the frontend’s contract
with the language."
Thus, the LLVM IR generated for the loads and stores in a function with the
__attribute__((min_vector_width(width))) that operate on vectors 'width'-wide
should satisfy the properties:
a. at-least 'width'-wide
b. marked 'volatile' (to prevent any subsequent phases from splitting them
up)
Assuming that property-a is correctly maintained by the front-end; the problem
reduces to ensuring that property-b holds.
Hand-modifying the generated LLVM IR:
define internal <2 x i64> @_mm_load_si128(<2 x i64>* %__p) #2 {
entry:
! %0 = load <2 x i64>, <2 x i64>* %__p, align 16
ret <2 x i64> %0
}
--- 1,11 ----
define internal <2 x i64> @_mm_load_si128(<2 x i64>* %__p) #2 {
entry:
! %0 = load volatile <2 x i64>, <2 x i64>* %__p, align 16
ret <2 x i64> %0
}
- we see that the 'load volatile <2 x i64>' does get converted to a 'movdqa',
as expected.
PS: the same issue is also seen with __m256i, __m512i and with
_mm_store_si128(), ...
I need your input on which of the following directions to take to fix this
issue:
* Marking the load-stores in the intrinsic as volatile during LLVM IR
generation.
* (Under an option) prohibiting the Combine redundant instructions pass from
modifying the vector-load load <2 x i64> into load i128.
* Making X86 DAG->DAG Instruction Selection generate VMOVDQArm instead of two
MOV64rm’s on load i128 for SSE.
References:
0. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE2
1. Difference between MOVDQA and MOVAPS x86 instructions?
https://stackoverflow.com/questions/6678073/\
difference-between-movdqa-and-movaps-x86-instructions
2. Clang supports the __attribute__((min_vector_width(width))) attribute.
https://clang.llvm.org/docs/AttributeReference.html#min-vector-width
3. Volatile Memory Accesses
https://llvm.org/docs/LangRef.html#id1277
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200905/a65e72cb/attachment.html>
More information about the llvm-bugs
mailing list