<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - [WebAssembly][SIMD] Codegen for trunc <16 x i32> to <16 x i8> can be improved"
   href="https://bugs.llvm.org/show_bug.cgi?id=51006">51006</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>[WebAssembly][SIMD] Codegen for trunc <16 x i32> to <16 x i8> can be improved
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: WebAssembly
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>jing.bao@intel.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>When I build a micro case with -O3 for both X86 (-m32 -msse2 -msse3 -msse4.1
-msse4.2) and Wasm(-msimd128), I found that the codegen for WebAssembly is not
that good.

```
 unsigned char buf[65536];
  #pragma clang loop vectorize_width(16) interleave_count(1)
  for (int i = 0; i < sizeof(buf); i++) {
      buf[i] = (char)(i * i);
  }

```

Above code will generates a trunc after Loop Vectorization.

```
%26 = trunc <16 x i32> %25 to <16 x i8>
```

For X86 Instruction Selection, it will be optimized to extract_subvector and
X86ISD::PACKUS.(See function combineVectorTruncation in
<a href="https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp">https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp</a>)

But for Wasm Instruction Selection, currently there's no similar optimization
for trunc, so it will be legalized to lots of extract_vector_elt and
insert_vector_elt.

The final Wasm bytecodes look like this
```
          (i8x16.replace_lane 15
            (i8x16.replace_lane 14
              (i8x16.replace_lane 13
                (i8x16.replace_lane 12
                  (i8x16.replace_lane 11
                    (i8x16.replace_lane 10
                      (i8x16.replace_lane 9
                        (i8x16.replace_lane 8
                          (i8x16.replace_lane 7
                            (i8x16.replace_lane 6
                              (i8x16.replace_lane 5
                                (i8x16.replace_lane 4
                                  (i8x16.replace_lane 3
                                    (i8x16.replace_lane 2
                                      (i8x16.replace_lane 1
                                        (i8x16.splat
                                          (i32x4.extract_lane 0
                                            (local.tee 3
                                              (i32x4.mul
                                                (local.get 7)
                                                (local.get 7)))))
                                        (i32x4.extract_lane 1
                                          (local.get 3)))
                                      (i32x4.extract_lane 2
                                        (local.get 3)))
                                    (i32x4.extract_lane 3
                                      (local.get 3)))
                                  (i32x4.extract_lane 0
                                    (local.tee 3
                                      (i32x4.mul
                                        (local.get 6)
                                        (local.get 6)))))
                                (i32x4.extract_lane 1
                                  (local.get 3)))
                              (i32x4.extract_lane 2
                                (local.get 3)))
                            (i32x4.extract_lane 3
                              (local.get 3)))
                          (i32x4.extract_lane 0
                            (local.tee 3
                              (i32x4.mul
                                (local.get 5)
                                (local.get 5)))))
                        (i32x4.extract_lane 1
                          (local.get 3)))
                      (i32x4.extract_lane 2
                        (local.get 3)))
                    (i32x4.extract_lane 3
                      (local.get 3)))
                  (i32x4.extract_lane 0
                    (local.tee 3
                      (i32x4.mul
                        (local.get 4)
                        (local.get 4)))))
                (i32x4.extract_lane 1
                  (local.get 3)))
              (i32x4.extract_lane 2
                (local.get 3)))
            (i32x4.extract_lane 3
              (local.get 3))))
```
Seems can be improved.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>