[llvm-bugs] [Bug 42152] New: Vectorized code degrades performance 3x swicthing from SSE4.1 to SSE4.2

Thu Jun 6 02:02:43 PDT 2019

https://bugs.llvm.org/show_bug.cgi?id=42152

            Bug ID: 42152
           Summary: Vectorized code degrades performance 3x swicthing from
                    SSE4.1 to SSE4.2
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: spreis at yandex-team.ru
                CC: llvm-bugs at lists.llvm.org

For the code below (https://gcc.godbolt.org/z/Z3JgG6) compilation with SSE4.2
produces 3x slower code that with just SSE4.1.

The code for SSE4.1 is pretty straightforwardly vectorized i32x2 and unrolled
by 2 producing clean and understandable code.

For SSE4.2 the code is again vectorized i32x2 and unrolled by 2, but some
optimization fuses series of nice load <2 x i32> scattered among the loop
(interleaved with compute) code into huge block of

  %64 = bitcast i32* %60 to <8 x i32>*
  %65 = bitcast i32* %63 to <8 x i32>*
  %66 = load <8 x i32>, <8 x i32>* %64, align 4, !dbg !46, !tbaa !48
  %67 = load <8 x i32>, <8 x i32>* %65, align 4, !dbg !46, !tbaa !48
  %68 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !46
  %69 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !46
  %70 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !46
  %71 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !46
  %72 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !46
  %73 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !46
  %74 = shufflevector <8 x i32> %66, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !46
  %75 = shufflevector <8 x i32> %67, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !46
  %76 = bitcast i32* %55 to <8 x i32>*
  %77 = bitcast i32* %58 to <8 x i32>*
  %78 = load <8 x i32>, <8 x i32>* %76, align 4, !dbg !52, !tbaa !48
  %79 = load <8 x i32>, <8 x i32>* %77, align 4, !dbg !52, !tbaa !48
  %80 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !52
  %81 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 0, i32 4>,
!dbg !52
  %82 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !52
  %83 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 1, i32 5>,
!dbg !52
  %84 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !52
  %85 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 2, i32 6>,
!dbg !52
  %86 = shufflevector <8 x i32> %78, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !52
  %87 = shufflevector <8 x i32> %79, <8 x i32> undef, <2 x i32> <i32 3, i32 7>,
!dbg !52

Each shuffle is than lowered into 4 incluctions:
        psllq   $32, %xmm6
        pshufd  $245, %xmm6, %xmm0      # xmm0 = xmm6[1,1,3,3]
        psrad   $31, %xmm6
        pblendw $51, %xmm0, %xmm6       # xmm6 =
xmm0[0,1],xmm6[2,3],xmm0[4,5],xmm6[6,7]

Those double number of instructions in a loop and significantly increase
register pressure. It seems that something wrong with the cost model for this
optimization. I hardly believe that such transformation can be ever profitable
with SSE4 if all: it provides 2x improvement on loads, but shuffles seem to be
too costly. 

---

The code:

    template <typename T, typename R = T>
    R AbsDiff(T a, T b) {
        if (a < b)
            return (R)b - (R)a;
        return (R)a - (R)b;
    }

    template <typename Number, typename Result = unsigned long long>
    Result L1DistanceImpl(const Number* lhs, const Number* rhs, int length) {
        Result s0 = 0;
        Result s1 = 0;
        Result s2 = 0;
        Result s3 = 0;

        while (length >= 4) {
            s0 += AbsDiff(lhs[0], rhs[0]);
            s1 += AbsDiff(lhs[1], rhs[1]);
            s2 += AbsDiff(lhs[2], rhs[2]);
            s3 += AbsDiff(lhs[3], rhs[3]);
            length -= 4;
            lhs += 4;
            rhs += 4;
        }

        while (length) {
            s0 += AbsDiff(*lhs++, *rhs++);
            --length;
        }

        return s0 + s1 + s2 + s3;
    }

unsigned long long L1Distance(const int* lhs, const int* rhs, int length) {
    return L1DistanceImpl<int>(lhs, rhs, length);
}

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190606/3ded8725/attachment-0001.html>