<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [WebAssembly][SIMD] Codegen for trunc <16 x i32> to <16 x i8> can be improved"

   href="https://bugs.llvm.org/show_bug.cgi?id=51006">51006</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[WebAssembly][SIMD] Codegen for trunc <16 x i32> to <16 x i8> can be improved

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: WebAssembly

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>jing.bao@intel.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>When I build a micro case with -O3 for both X86 (-m32 -msse2 -msse3 -msse4.1

-msse4.2) and Wasm(-msimd128), I found that the codegen for WebAssembly is not

that good.

```

 unsigned char buf[65536];

  #pragma clang loop vectorize_width(16) interleave_count(1)

  for (int i = 0; i < sizeof(buf); i++) {

      buf[i] = (char)(i * i);

  }

```

Above code will generates a trunc after Loop Vectorization.

```

%26 = trunc <16 x i32> %25 to <16 x i8>

```

For X86 Instruction Selection, it will be optimized to extract_subvector and

X86ISD::PACKUS.(See function combineVectorTruncation in

<a href="https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp">https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp</a>)

But for Wasm Instruction Selection, currently there's no similar optimization

for trunc, so it will be legalized to lots of extract_vector_elt and

insert_vector_elt.

The final Wasm bytecodes look like this

```

          (i8x16.replace_lane 15

            (i8x16.replace_lane 14

              (i8x16.replace_lane 13

                (i8x16.replace_lane 12

                  (i8x16.replace_lane 11

                    (i8x16.replace_lane 10

                      (i8x16.replace_lane 9

                        (i8x16.replace_lane 8

                          (i8x16.replace_lane 7

                            (i8x16.replace_lane 6

                              (i8x16.replace_lane 5

                                (i8x16.replace_lane 4

                                  (i8x16.replace_lane 3

                                    (i8x16.replace_lane 2

                                      (i8x16.replace_lane 1

                                        (i8x16.splat

                                          (i32x4.extract_lane 0

                                            (local.tee 3

                                              (i32x4.mul

                                                (local.get 7)

                                                (local.get 7)))))

                                        (i32x4.extract_lane 1

                                          (local.get 3)))

                                      (i32x4.extract_lane 2

                                        (local.get 3)))

                                    (i32x4.extract_lane 3

                                      (local.get 3)))

                                  (i32x4.extract_lane 0

                                    (local.tee 3

                                      (i32x4.mul

                                        (local.get 6)

                                        (local.get 6)))))

                                (i32x4.extract_lane 1

                                  (local.get 3)))

                              (i32x4.extract_lane 2

                                (local.get 3)))

                            (i32x4.extract_lane 3

                              (local.get 3)))

                          (i32x4.extract_lane 0

                            (local.tee 3

                              (i32x4.mul

                                (local.get 5)

                                (local.get 5)))))

                        (i32x4.extract_lane 1

                          (local.get 3)))

                      (i32x4.extract_lane 2

                        (local.get 3)))

                    (i32x4.extract_lane 3

                      (local.get 3)))

                  (i32x4.extract_lane 0

                    (local.tee 3

                      (i32x4.mul

                        (local.get 4)

                        (local.get 4)))))

                (i32x4.extract_lane 1

                  (local.get 3)))

              (i32x4.extract_lane 2

                (local.get 3)))

            (i32x4.extract_lane 3

              (local.get 3))))

```

Seems can be improved.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>