[llvm-bugs] [Bug 51006] New: [WebAssembly][SIMD] Codegen for trunc <16 x i32> to <16 x i8> can be improved

via llvm-bugs llvm-bugs at lists.llvm.org
Wed Jul 7 02:07:04 PDT 2021


https://bugs.llvm.org/show_bug.cgi?id=51006

            Bug ID: 51006
           Summary: [WebAssembly][SIMD] Codegen for trunc <16 x i32> to
                    <16 x i8> can be improved
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: WebAssembly
          Assignee: unassignedbugs at nondot.org
          Reporter: jing.bao at intel.com
                CC: llvm-bugs at lists.llvm.org

When I build a micro case with -O3 for both X86 (-m32 -msse2 -msse3 -msse4.1
-msse4.2) and Wasm(-msimd128), I found that the codegen for WebAssembly is not
that good.

```
 unsigned char buf[65536];
  #pragma clang loop vectorize_width(16) interleave_count(1)
  for (int i = 0; i < sizeof(buf); i++) {
      buf[i] = (char)(i * i);
  }

```

Above code will generates a trunc after Loop Vectorization.

```
%26 = trunc <16 x i32> %25 to <16 x i8>
```

For X86 Instruction Selection, it will be optimized to extract_subvector and
X86ISD::PACKUS.(See function combineVectorTruncation in
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp)

But for Wasm Instruction Selection, currently there's no similar optimization
for trunc, so it will be legalized to lots of extract_vector_elt and
insert_vector_elt.

The final Wasm bytecodes look like this
```
          (i8x16.replace_lane 15
            (i8x16.replace_lane 14
              (i8x16.replace_lane 13
                (i8x16.replace_lane 12
                  (i8x16.replace_lane 11
                    (i8x16.replace_lane 10
                      (i8x16.replace_lane 9
                        (i8x16.replace_lane 8
                          (i8x16.replace_lane 7
                            (i8x16.replace_lane 6
                              (i8x16.replace_lane 5
                                (i8x16.replace_lane 4
                                  (i8x16.replace_lane 3
                                    (i8x16.replace_lane 2
                                      (i8x16.replace_lane 1
                                        (i8x16.splat
                                          (i32x4.extract_lane 0
                                            (local.tee 3
                                              (i32x4.mul
                                                (local.get 7)
                                                (local.get 7)))))
                                        (i32x4.extract_lane 1
                                          (local.get 3)))
                                      (i32x4.extract_lane 2
                                        (local.get 3)))
                                    (i32x4.extract_lane 3
                                      (local.get 3)))
                                  (i32x4.extract_lane 0
                                    (local.tee 3
                                      (i32x4.mul
                                        (local.get 6)
                                        (local.get 6)))))
                                (i32x4.extract_lane 1
                                  (local.get 3)))
                              (i32x4.extract_lane 2
                                (local.get 3)))
                            (i32x4.extract_lane 3
                              (local.get 3)))
                          (i32x4.extract_lane 0
                            (local.tee 3
                              (i32x4.mul
                                (local.get 5)
                                (local.get 5)))))
                        (i32x4.extract_lane 1
                          (local.get 3)))
                      (i32x4.extract_lane 2
                        (local.get 3)))
                    (i32x4.extract_lane 3
                      (local.get 3)))
                  (i32x4.extract_lane 0
                    (local.tee 3
                      (i32x4.mul
                        (local.get 4)
                        (local.get 4)))))
                (i32x4.extract_lane 1
                  (local.get 3)))
              (i32x4.extract_lane 2
                (local.get 3)))
            (i32x4.extract_lane 3
              (local.get 3))))
```
Seems can be improved.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210707/0af31467/attachment.html>


More information about the llvm-bugs mailing list