<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - Failure to simplify SIMD vector conversion."

   href="http://llvm.org/bugs/show_bug.cgi?id=16739">16739</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Failure to simplify SIMD vector conversion.

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>silvas@purdue.edu

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvmbugs@cs.uiuc.edu

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>This code was reduced from a function that converted between SIMD vector

classes used by two different libraries; the source and destination vectors

have a <4 x float> underlying storage, but notionally hold only {x, y, z} (and

the destination duplicates z into the last lane; the source leaves it undefined

I think).

typedef float __m128 __attribute__((__vector_size__(16)));

union ElementWiseAccess {

  ElementWiseAccess(__m128 v) : ReprM128(v) {}

  __m128 ReprM128;

  float ReprFloatArray[4];

  float getAt(int i) const { return ReprFloatArray[i]; }

};

// Making this return `const ElementWiseAccess` instead of `const

ElementWiseAccess &`

// still results in a failure to optimize, but in a different way.

static const ElementWiseAccess &castToElementWiseAccess(const __m128 &t) {

  return reinterpret_cast<const ElementWiseAccess &>(t);

}

__m128 ConvertVectors(const __m128 &V) {

  // Replacing `castToElementWiseAccess` with directly calling

  // `ElementWiseAccess` makes the issue go away.

  return (__m128) { castToElementWiseAccess(V).getAt(0), //

                    castToElementWiseAccess(V).getAt(1), //

                    castToElementWiseAccess(V).getAt(2), //

                    castToElementWiseAccess(V).getAt(2) };

}

clang -O3 produces:

define <4 x float> @_Z14ConvertVectorsRKDv4_f(<4 x float>* nocapture readonly

%V) #0 {

  %1 = bitcast <4 x float>* %V to [4 x float]*

  %2 = getelementptr inbounds <4 x float>* %V, i64 0, i64 0

  %3 = load float* %2, align 4, !tbaa !0

  %4 = insertelement <4 x float> undef, float %3, i32 0

  %5 = getelementptr inbounds [4 x float]* %1, i64 0, i64 1

  %6 = load float* %5, align 4, !tbaa !0

  %7 = insertelement <4 x float> %4, float %6, i32 1

  %8 = getelementptr inbounds [4 x float]* %1, i64 0, i64 2

  %9 = load float* %8, align 4, !tbaa !0

  %10 = insertelement <4 x float> %7, float %9, i32 2

  %11 = insertelement <4 x float> %10, float %9, i32 3

  ret <4 x float> %11

}

It appears that something is interfering with folding the load/insertelement

sequence into a vector load + shufflevector.

Making the modification indicated in the comments of having

`castToElementWiseAccess` return by value instead of by reference results in:

define <4 x float> @_Z14ConvertVectorsRKDv4_f(<4 x float>* nocapture readonly

%V) #0 {

  %1 = bitcast <4 x float>* %V to i8*

  %2 = bitcast <4 x float>* %V to double*

  %3 = load double* %2, align 16

  %4 = getelementptr inbounds i8* %1, i64 8

  %5 = bitcast i8* %4 to double*

  %6 = bitcast double %3 to i64

  %trunc = trunc i64 %6 to i32

  %bitcast = bitcast i32 %trunc to float

  %7 = insertelement <4 x float> undef, float %bitcast, i32 0

  %8 = lshr i64 %6, 32

  %9 = trunc i64 %8 to i32

  %10 = bitcast i32 %9 to float

  %11 = insertelement <4 x float> %7, float %10, i32 1

  %12 = load double* %5, align 8

  %13 = bitcast double %12 to i64

  %trunc6 = trunc i64 %13 to i32

  %bitcast7 = bitcast i32 %trunc6 to float

  %14 = insertelement <4 x float> %11, float %bitcast7, i32 2

  %15 = insertelement <4 x float> %14, float %bitcast7, i32 3

  ret <4 x float> %15

}

The issue in this case seems to be that clang lowers `castToElementWiseAccess`

as returning `{double, double}`, which then prevents a <4 x float> load being

generated.

Making the modification of replacing the call to `castToElementWiseAcess` with

directly invoking the constructor (e.g. `ElementWiseAccess(V).getAt(<<<n>>>)`)

results in the following code, which is the desired codegen for the initial

test case:

define <4 x float> @_Z14ConvertVectorsRKDv4_f(<4 x float>* nocapture readonly

%V) #0 {

  %1 = load <4 x float>* %V, align 16, !tbaa !0

  %2 = shufflevector <4 x float> %1, <4 x float> undef, <4 x i32> <i32 0, i32

1, i32 2, i32 2>

  ret <4 x float> %2

}</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>