<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - IndVarSimplify + InstCombine integer widening does not play nice with loop vectorizer"

   href="https://llvm.org/bugs/show_bug.cgi?id=28070">28070</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>IndVarSimplify + InstCombine integer widening does not play nice with loop vectorizer

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Scalar Optimizations

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>mkuper@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>davidxl@google.com, llvm-bugs@lists.llvm.org, wmi@google.com

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Consider a reduction loop, which reduces a multiplication of i32s into an i64:

long long foo() {

long long x = 42;

#pragma nounroll

#pragma clang loop interleave_count(1)

  for (int i = 0; i < 1000; i++) {

    x += i * i;

  }

  return x;

}

For:

$ clang -c -S -o - -O3 -mavx2 --target=x86_64

We'd like to get:

.LBB0_1:

    vpmulld    %xmm2, %xmm2, %xmm3

    vpaddd    %xmm1, %xmm2, %xmm2

    vpmovzxdq    %xmm3, %ymm3

    vpaddq    %ymm0, %ymm3, %ymm0

    addq    $-4, %rax

    jne    .LBB0_1

What we actually get is:

.LBB0_1:

    vpsrlq    $32, %ymm3, %ymm4

    vpmuludq    %ymm4, %ymm3, %ymm4

    vpmuludq    %ymm3, %ymm3, %ymm5

    vpaddq    %ymm1, %ymm3, %ymm3

    vpsllq    $32, %ymm4, %ymm4

    vpaddq    %ymm4, %ymm5, %ymm5

    vpaddq    %ymm4, %ymm5, %ymm4

    vpblendd    $170, %ymm2, %ymm4, %ymm4

    vpaddq    %ymm0, %ymm4, %ymm0

    addq    $-4, %rax

    jne    .LBB0_1

What happens is that IndVarSimplify promotes the induction variable from i32 to

i64:

for.body:

  %i.09 = phi i32 [ 0, %entry ], [ %inc, %for.body ]                   <==

  %x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]

  %mul = mul nsw i32 %i.09, %i.09 

  %conv7 = zext i32 %mul to i64

  %add = add nsw i64 %conv7, %x.08

  %inc = add nsw i32 %i.09, 1                                          <==

  %cmp = icmp slt i32 %inc, 1000

  br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !1

}

becomes

for.body:                                         ; preds = %entry, %for.body

  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] <==

  %x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]

  %0 = trunc i64 %indvars.iv to i32                                    <== 

  %mul = mul nsw i32 %0, %0

  %conv7 = zext i32 %mul to i64

  %add = add nsw i64 %conv7, %x.08

  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1                    <==

  %exitcond = icmp ne i64 %indvars.iv.next, 1000

  br i1 %exitcond, label %for.body, label %for.cond.cleanup, !llvm.loop !1

And then InstCombine notices the trunc -> mul -> zext, and promotes the whole

thing to i64:

for.body:                                         ; preds = %for.body, %entry

  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]

  %x.08 = phi i64 [ 42, %entry ], [ %add, %for.body ]

  %mul = mul i64 %indvars.iv, %indvars.iv                              <==

  %conv7 = and i64 %mul, 4294967295                                    <==

  %add = add nsw i64 %conv7, %x.08

  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1

  %exitcond = icmp eq i64 %indvars.iv.next, 1000

  br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !1

This is unfortunate, because we end up with illegal vector muls, and really

messy codegen. Note that the issue is not AVX2-specific, that's just the

cleanest example. We can get similar nonsense with other feature-sets.

Undoing this in codegen (by matching the "mul + and" back into a "trunc + mul +

zext") doesn't seem sufficient, since ideally we'd also like the vectorizer to

know what the real width is going to be. What we really want is to vectorize

this reduction by a factor of 8, like GCC does, and not by 4, and that would

require the cost model to know that we're reducing i32 values into an i64

result.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>