[llvm-bugs] [Bug 42152] New: Vectorized code degrades performance 3x swicthing from SSE4.1 to SSE4.2
via llvm-bugs
llvm-bugs at lists.llvm.org
Thu Jun 6 02:02:43 PDT 2019
https://bugs.llvm.org/show_bug.cgi?id=42152
Bug ID: 42152
Summary: Vectorized code degrades performance 3x swicthing from
SSE4.1 to SSE4.2
Product: libraries
Version: trunk
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Loop Optimizer
Assignee: unassignedbugs at nondot.org
Reporter: spreis at yandex-team.ru
CC: llvm-bugs at lists.llvm.org
For the code below (https://gcc.godbolt.org/z/Z3JgG6) compilation with SSE4.2
produces 3x slower code that with just SSE4.1.
The code for SSE4.1 is pretty straightforwardly vectorized i32x2 and unrolled
by 2 producing clean and understandable code.
For SSE4.2 the code is again vectorized i32x2 and unrolled by 2, but some
optimization fuses series of nice load <2 x i32> scattered among the loop
(interleaved with compute) code into huge block of
%64 = bitcast i32* %60 to <8 x i32>*
%65 = bitcast i32* %63 to <8 x i32>*
%66 = load <8 x i32>, <8 x i32>* %64, align 4, !dbg !46, !tbaa !48
%67 = load <8 x i32>, <8 x i32>* %65, align 4, !dbg !46, !tbaa !48
%68 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !46
%69 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !46
%70 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !46
%71 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !46
%72 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !46
%73 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !46
%74 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !46
%75 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !46
%76 = bitcast i32* %55 to <8 x i32>*
%77 = bitcast i32* %58 to <8 x i32>*
%78 = load <8 x i32>, <8 x i32>* %76, align 4, !dbg !52, !tbaa !48
%79 = load <8 x i32>, <8 x i32>* %77, align 4, !dbg !52, !tbaa !48
%80 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !52
%81 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !52
%82 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !52
%83 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !52
%84 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !52
%85 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !52
%86 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !52
%87 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !52
Each shuffle is than lowered into 4 incluctions:
psllq $32, %xmm6
pshufd $245, %xmm6, %xmm0 # xmm0 = xmm6[1,1,3,3]
psrad $31, %xmm6
pblendw $51, %xmm0, %xmm6 # xmm6 =
xmm0[0,1],xmm6[2,3],xmm0[4,5],xmm6[6,7]
Those double number of instructions in a loop and significantly increase
register pressure. It seems that something wrong with the cost model for this
optimization. I hardly believe that such transformation can be ever profitable
with SSE4 if all: it provides 2x improvement on loads, but shuffles seem to be
too costly.
---
The code:
template <typename T, typename R = T>
R AbsDiff(T a, T b) {
if (a < b)
return (R)b - (R)a;
return (R)a - (R)b;
}
template <typename Number, typename Result = unsigned long long>
Result L1DistanceImpl(const Number* lhs, const Number* rhs, int length) {
Result s0 = 0;
Result s1 = 0;
Result s2 = 0;
Result s3 = 0;
while (length >= 4) {
s0 += AbsDiff(lhs[0], rhs[0]);
s1 += AbsDiff(lhs[1], rhs[1]);
s2 += AbsDiff(lhs[2], rhs[2]);
s3 += AbsDiff(lhs[3], rhs[3]);
length -= 4;
lhs += 4;
rhs += 4;
}
while (length) {
s0 += AbsDiff(*lhs++, *rhs++);
--length;
}
return s0 + s1 + s2 + s3;
}
unsigned long long L1Distance(const int* lhs, const int* rhs, int length) {
return L1DistanceImpl<int>(lhs, rhs, length);
}
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190606/3ded8725/attachment-0001.html>
More information about the llvm-bugs
mailing list