[PATCH] D80863: [WebAssembly] Eliminate range checks on br_tables

Paolo Severini via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Jun 8 01:02:59 PDT 2020


paolosev added a comment.

I was testing the performance of a program with a big switch statement in a loop, a very common pattern in C/C++ and I came across the problem with needless range checks with br_table that this patch is fixing (Sweet! :-))
However, testing the latest version of Emscripten, with this fix, I am finding that Clang now emits "worse" bytecode for my small test program (attached{F12106893 <https://reviews.llvm.org/F12106893>}, compiled with `emcc -O3  micro-interp.c -o micro-interp.js`).
The code is like:

  enum Op {
      A = 0, B, C, D, E, F, G, H, I, J, K, L
  };
  int f(Op* ops, int len) {
      int result = 0;
      for (int i = 0; i < len; i++) {
          Op op = ops[i];
          switch (op) {
              case A: { ... break; }
              case B: { ... break; }
              case C: { ... break; }
  		...
              default: { ... break; }
          }
      }
      return result;
  }

Before this change, the function was compiled into this Wasm code, with a single `br_table` that was not using its default label (see WAT file attached: F12107110: emcc-0-base.OLD.wat <https://reviews.llvm.org/F12107110>):

  (func (;11;) (type 2) (param i32)
    (local i32 i32 i32 i32 i32 i64)
    ...
    block  ;; label = @1
      block  ;; label = @2
        loop  ;; label = @3
          local.get 4
          i32.load16_u
          local.tee 3
          i32.const 11
          i32.gt_u
          br_if 1 (;@2;)
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      block  ;; label = @10
                        block  ;; label = @11
                          block  ;; label = @12
                            block  ;; label = @13
                              block  ;; label = @14
                                local.get 3
                                br_table 10 (;@4;) 0 (;@14;) 1 (;@13;) 2 (;@12;) 3 (;@11;) 4 (;@10;) 5 (;@9;) 6 (;@8;) 7 (;@7;) 8 (;@6;) 13 (;@1;) 9 (;@5;) 10 (;@4;)
                              end
                              // case 1
                            end
                            // case 2
                          end
                        ...

The x64 code jitted by V8 for the switch was reasonably compact, though not-optimal:

  00000000BD10FA43    83  460fb73423     movzxwl r14,[rbx+r12*1]
  00000000BD10FA48    88  4183fe0b       cmpl r14,0xB 
  00000000BD10FA4C    8c  0f8741020000   ja 00000000BD10FC93  // jmp to default case
  00000000BD10FA52    92  4183ee0        subl r14,0x1
  00000000BD10FA56    96  458bf6         movl r14,r14
  00000000BD10FA59    99  4183fe0b       cmpl r14,0xB         
  00000000BD10FA5D    9d  0f830d000000   jnc 00000000BD10FA70 // jmp to br_table default label
  00000000BD10FA63    a3  4c8d1556030000 leaq r10,[rip+0x356]
  00000000BD10FA6A    aa  43ff24f2       jmp [r10+r14*8]      // br_table jump

There were two checks (`cmp`/`jmp`), the first for the switch default case, the second for the implementation of br_table.

But now with this change I see this code being generated (see WAT file attached: F12107026: emcc-0-base.NEW.wat <https://reviews.llvm.org/F12107026>)

  (func (;6;) (type 6) (param i32)
    (local i32 i32 i32 i32 i64 i32)
    ...
    block  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      block  ;; label = @10
                        block  ;; label = @11
                          block  ;; label = @12
                            block  ;; label = @13
                              block  ;; label = @14
                                local.get 4
                                i32.load16_u
                                br_table 1 (;@13;) 0 (;@14;) 2 (;@12;) 3 (;@11;) 4 (;@10;) 5 (;@9;) 6 (;@8;) 7 (;@7;) 8 (;@6;) 9 (;@5;) 13 (;@1;) 10 (;@4;) 12 (;@2;)
                              end
                              i32.const 0
                              local.set 3
                              br 10 (;@3;)
                            end
                            i32.const 10
                            local.set 3
                            br 9 (;@3;)
                          end
                          i32.const 1
                          local.set 3
                          br 8 (;@3;)
                        end
                        i32.const 2
                        local.set 3
                        br 7 (;@3;)
                      end
                      i32.const 3
                      local.set 3
                      br 6 (;@3;)
                    end
                    i32.const 4
                    local.set 3
                    br 5 (;@3;)
                  end
                  i32.const 5
                  local.set 3
                  br 4 (;@3;)
                end
                i32.const 6
                local.set 3
                br 3 (;@3;)
              end
              i32.const 7
              local.set 3
              br 2 (;@3;)
            end
            i32.const 8
            local.set 3
            br 1 (;@3;)
          end
          i32.const 9
          local.set 3
        end
        loop  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      block  ;; label = @10
                        block  ;; label = @11
                          block  ;; label = @12
                            block  ;; label = @13
                              block  ;; label = @14
                                block  ;; label = @15
                                  block  ;; label = @16
                                    block  ;; label = @17
                                      block  ;; label = @18
                                        block  ;; label = @19
                                          block  ;; label = @20
                                            block  ;; label = @21
                                              block  ;; label = @22
                                                block  ;; label = @23
                                                  block  ;; label = @24
                                                    block  ;; label = @25
                                                      local.get 3
                                                      br_table 0 (;@25;) 1 (;@24;) 2 (;@23;) 3 (;@22;) 4 (;@21;) 5 (;@20;) 6 (;@19;) 7 (;@18;) 8 (;@17;) 9 (;@16;) 10 (;@15;) 10 (;@15;)
                                                    end
                                                    // ... case 0
                                                    ...
                                                    br_table 19 (;@5;) 20 (;@4;) 10 (;@14;) 11 (;@13;) 12 (;@12;) 13 (;@11;) 14 (;@10;) 15 (;@9;) 16 (;@8;) 17 (;@7;) 23 (;@1;) 18 (;@6;) 22 (;@2;)
                                                  end
                                                  // ... case 1
                                                  br_table 18 (;@5;) 19 (;@4;) 9 (;@14;) 10 (;@13;) 11 (;@12;) 12 (;@11;) 13 (;@10;) 14 (;@9;) 15 (;@8;) 16 (;@7;) 22 (;@1;) 17 (;@6;) 21 (;@2;)
                                                end
                                              ...

There is a `br_table` that causes an indirect jump to a stub like `i32.const X` `local.set 3` `br Y` that jumps to the actual code for the case branches, which all end with other strange br_tables.
Obviously also the native code jitted is much more convoluted. The result is that my small benchmark is 34% slower (33.9 sec vs 25.2).

I see that this issue disappears if I comment away `addPass(createWebAssemblyFixBrTableDefaults());` in WebAssemblyPassConfig::addInstSelector() and reintroduce `Ops.push_back(DAG.getBasicBlock(MBBs[0]))` in WebAssemblyTargetLowering::LowerBR_JT().

Maybe there is something wrong with my configuration, in my machine, but I double-tested this reinstalling the latest Emscripten, and can reliably reproduce the problem.
Could you take a look?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80863/new/

https://reviews.llvm.org/D80863





More information about the llvm-commits mailing list