[llvm-bugs] [Bug 24965] New: Missed folding of a constant expression causes very poor codegen.

Mon Sep 28 05:43:36 PDT 2015

https://llvm.org/bugs/show_bug.cgi?id=24965

            Bug ID: 24965
           Summary: Missed folding of a constant expression causes very
                    poor codegen.
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P
         Component: Scalar Optimizations
          Assignee: unassignedbugs at nondot.org
          Reporter: andrea.dibiagio at gmail.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Created attachment 14947
  --> https://llvm.org/bugs/attachment.cgi?id=14947&action=edit
Reproducible

Problem seen in the following code:

////////  test.cpp  //////////
#include <x86intrin.h>

inline void il128_(__m128i &rc, char _0, char _1, char _2, char _3, char _4,
char _5, char _6, char _7, char _8, char _9, char _A, char _B, char _C, char
_D, char _E, char _F) {
  rc = _mm_set_epi8(_F, _E, _D, _C, _B, _A, _9, _8, _7, _6, _5, _4, _3, _2, _1,
_0);
}

__m128i B_shufb_(__m128i ra, __m128i rb)
{
  __m128i rc;
  il128_(rc, 0x19, 0x9, 0x9, 0x19, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80,
0x80, 0x80, 0x80, 0x80, 0x80);
  const __m128i SelMaskP = _mm_and_si128(_mm_srli_epi16(rc,
5),_mm_set1_epi16(0x0707));
  __m128 TMP = _mm_and_si128(rc, _mm_set1_epi32(0xf0f0f0f0));
  const __m128i SelMaskA = _mm_cmpeq_epi8(_mm_setzero_si128(), TMP);
  const __m128i SelMaskB = _mm_cmpeq_epi8(_mm_set1_epi16(0x1010), TMP);;
  const __m128i FromA = _mm_and_si128(SelMaskA, _mm_shuffle_epi8(ra, rc));
  const __m128i FromB = _mm_and_si128(SelMaskB, _mm_shuffle_epi8(rb, rc));
  const __m128i FromPattern = _mm_shuffle_epi8( _mm_set_epi8( 0x00, 0x00, 0x00,
0x00, 0x0, 0x00, 0x00, 0x00, 0x80, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 ),
SelMaskP );
  return (FromA | FromB) | FromPattern;
}
////////

> clang test.cpp -O2 -march=corei7 -S -o -

The compiler generates an extremely long sequence of 16 basic blocks. The
entire sequence appears to be caused by a missing folding on a complex constant
expression involving bitcasts, AND and ICMP.

I have also attached a .ll reproducible named repro.ll.

> opt repro.ll -inline-cost -inline -sroa -early-cse -S -o -

You should be able to see an (extremely long) constant expression as a result
of early-cse.

===========================

Analysis of the problem found:
------------------------------
In the reproducible, Vector 'rc' is passed by reference to function il128_.
Therefore the compiler generates an alloca to reserve a stack location for
'rc'.

After inlining, SROA promotes 'rc' to a register via 'mem2reg'. In our
reproducible, the optimizer (after inlining) will be able to fold the call to
il128_ to constant vector:

<16 x i8> <i8 25, i8 9, i8 9 i8 25, i8 -128, i8 -128, i8 -128, i8 -128, i8
-128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>

For simplicity, let's call the above constant vector %CV.

Before SROA is run, the LLVM IR at the beginning of function 'B_shufb_' looks
like this:

   %rc = alloca <2 x i64>, align 16
   %0 = bitcast <2 x i64>* %rc to <16 x i8>*
   store <16 x i8> %CV, <16 x i8>* %0, align 16
   ...
   %1 = load <2 x i64>, <2 x i64>* %rc, align 16
   ...

SROA identifies one alloca slice (slice [0,16) ) related to alloca instruction:
 %rc = alloca <2 x i64>, align 16

That slice is used by a vector store:
   store <16 x i8> %CV, <16 x i8>* %0, align 16

..and a vector load:
   %1 = load <2 x i64>, <2 x i64>* %rc, align 16

Here is the important part:
SROA replaces *ALL* the occurrences of %CV with the following constant
expression: bitcast (%CV to <2 x i64>).

So, the following store:
  store <16 x i8> %CV, <16 x i8>* %0, align 16

becomes:
  store <2 x i64> bitcast (<16 x i8> %CV to <2 x i64>), <2 x i64>* %rc, align
16

Note that this is perfectly okay. An advantage is that the same vector type is
now used by the load and store.

It is worth noting however how SROA does not check if the stored value is a
constant. So, it doesn't immediately simplify the bitcast of %CV; it instead
propagates a new constant expression to all the users of %CV.

Note that in theory this behavior is perfectly valid; other passes (example:
Instruction Simplify) should be responsible for folding all the trivial
constant expressions.
Unfortunately this doesn't always happen in practice; Instruction Simplify
doesn't know how to fold complicated constant expressions involving a mix of
binary operators, compare and bitcasts.

Before SROA is run we have code like this:

  %1 = load <2 x i64>, <2 x i64>* %rc, align 16
  ...
  %and.i.44 = and <2 x i64> %1, <2 x i64> <i64 -1085102592571150096, i64
-1085102592571150096>
  %5 = bitcast <2 x i64> %and.i.44 to <16 x i8>
  %cmp.i.39 = icmp eq <16 x i8> %5, zeroinitializer

After SROA, the code above is rewritten to this:

  %and.i.44 = and <2 x i64> bitcast (<16 x i8> %Invec), <2 x i64> <i64
-1085102592571150096, i64 -1085102592571150096>
  %3 = bitcast <2 x i64> %and.i.44 to <16 x i8>
  %cmp.i.39 = icmp eq <16 x i8> %3, zeroinitializer

At this point the damage is done because Instruction Simplify is not smart
enough to fold %and.i.44 to a constant. So, EarlyCSE decides to scalarize that
awkward constant expression obtaining the following (even more problematic)
constant expression:

<16 x i8> bitcast (<2 x i64> <i64 and (i64 extractelement (<2 x i64> bitcast
(<16 x i8> <i8 8, i8 9, i8 24, i8 25, i8 -128, i8 -128, i8 -128, i8 -128, i8
-128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128> to <2 x
i64>), i32 0), i64 -1085102592571150096), i64 and (i64 extractelement (<2 x
i64> bitcast (<16 x i8> <i8 8, i8 9, i8 24, i8 25, i8 -128, i8 -128, i8 -128,
i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8
-128> to <2 x i64>), i32 1), i64 -1085102592571150096)> to <16 x i8>)

The major problem is that the new expression is still a ConstantExpr! So, when
EarlyCSE replaces the old expression with the new one, it also propagates the
new value to all the users.

Unfortunately, one of the users is the following vector integer compare
instruction:

  %cmp.i.39 = icmp eq <16 x i8> %3, zeroinitializer

After constant propagation, %cmp.i.39 becomes like this:

%cmp.i.39 = icmp eq <16 x i8> bitcast (<2 x i64> <i64 and (i64 extractelement
(<2 x i64> bitcast (<16 x i8> <i8 8, i8 9, i8 24, i8 25, i8 -128, i8 -128, i8
-128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128,
i8 -128> to <2 x i64>), i32 0), i64 -1085102592571150096), i64 and (i64
extractelement (<2 x i64> bitcast (<16 x i8> <i8 8, i8 9, i8 24, i8 25, i8
-128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128,
i8 -128, i8 -128, i8 -128> to <2 x i64>), i32 1), i64 -1085102592571150096)> to
<16 x i8>), zeroinitializer

EarlyCSE tries (but fails) again to simplify the instruction above. Eventually
it decides to scalarize the vector compare ending up with an extremely long and
sub-optimal constant expression which is *16 times* longer than the one above!

That extremely long constant expression is not folded away and eventually
reaches the backend where it gets expanded into a sequence of 16 basic blocks
(one basic block per each scalar compare).

I hope this description helps!

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150928/23e3019c/attachment-0001.html>