<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/101725>101725</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [WebAssembly] Masked lanes in auto-vectorization
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:WebAssembly
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          yurydelendik
      </td>
    </tr>
</table>

<pre>
    Currently working on issue https://bugzilla.mozilla.org/show_bug.cgi?id=1887312 . I discovered inefficient/high-cost (from WebAssembly compilation point of view) shuffles. These shuffles generate more than 4 native instruction that also could read a constant from memory.

Example is:
```wat


 local.get 3
    local.get 3
    local.get 3
    i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
 local.get 3
    local.get 3
    i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
 local.get 3
    i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
 i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
 i8x16.extract_lane_u 0
```

I think it is the result of something like:

```c
int run(unsigned char* a) {
  unsigned char m = 255;
  for (int i = 0; i < 10000; i++) if (a[i] < m) m = a[i];
  return m;
}
```

The shuffles in the example above do not really need to specify a particular lane (0), from logic point of view. If this lanes value will be chosen as different values instead of 0, the Wasm compiler could produce far more efficient instructions, but these may benefit specific CPU.

I opened an issue about masked lanes https://github.com/WebAssembly/flexible-vectors/issues/66 with more examples.

In SpiderMonkey, we already matching some shuffle patterns to better select CPU instructions (https://searchfox.org/mozilla-central/source/js/src/jit/ShuffleAnalysis.h), but these shuffles hard to match.

This is more RFC issue to find out internals of WebAssembly target and auto-vectorization. And is it possible to improve generation of the shuffles in Wasm when some lanes does not matter.

/cc @ppenzin @tlively

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJykVl2Pm7wS_jXOzWgRmHzARS6ymxOpF5WOTnvUy9VghjCNsZFtkk1__ZGBTZOqZ_W-fZUIjBnP5zPPgN7z0RBtxepZrPYLHEJr3fY6uGtNmkzNp0Vl6-v2ZXCOTNBXuFh3YnMEa4C9HwjaEHov8p2QByEP1XD8wVpj0tnpbt1RyINv7eW1Go6JOrLID1yLfJ8VxSbPJCTwCWr2yp7JUQ1sqGlYMZkg5KHlY_ukrA8gZNE428E3qnbeU1fpKyjb9awxsDXQWzYBbANnpouQJfh2aBpNPoGvLXm6PcORDDkMBJ11BKFFA0swGPhMwMYHN6hRY2gxAGpvQdlB1-AIa0BQ1viAJsDoTkeddddEpHuR7qbrv96w6zUBj2mZXqzT6X_BcC86XUFbhTo5UoB83gH4G5tcvGXrZI4PCighSyHLIJOQ5ZAtIVtB-vj79XDH5nX41Uwgupn5U2-WsII1bH61_8d-_NbqX_BDQv7_fLjz45848dvNRyeyj1z4OBnTDr0Fhyq8ajT0OryfuKHrHlSfILRsTsAB2ENoCRz5QY8t4m1H8e0RNJ_oJ0oftanpObaVG4yQxWBGuqhBteiE3AHGPhOb53fHHwSgA5HvQa5WIr9JNNbFTo46eXydivx5XL5Alqbp9Cjk8_gvgZsojmL1zGK1H8W6uD_pft-_M-AoDM5Ad9sSm_0HWfra3hEDmzFPNPcvVvZMUFswNsTm1_oKhqiGYMH3pLi5AkKPLrAaNDqIVYnupkKWQr5MBKHtkdUjOyXwqYnV8eMJD2fUA8GFtYaKQLXWkwH0UHPTUOTdScKP7BRJyDaQRgPR22_ou5kIyc1M1TtbD4qgiVWIJHej1HuC81FFNYSoxhN0eIWKDDUc5vBYwcu__5s8osr2FCuM7_yPlR0CdOhPVM_xPI6EI4d2qBJlOyEPd-wt5KHR9MaVpqczqWCdF_IwKo2L9RouHNrZ_6kk_tEXA196rsl9tuZE1xjNhQB15OkrdBjUCPEI9vcaQ48hkDM-1rCiuAZPmlSIkT4kJxbyMRBP6FTb2Ld5qM0j7kmRCQ51lLCDUyTk4XuMwDsVlxzn2JfJ_s6gvnr2STtD5Gf6byhs0Y0QGwNIHsHKPjbzmJL_HF7mCgQLDZsaYh3YxPBQ-4iR-1EZ0EVeQlMDDsHOGecf4_BMYGfqqJkD9Nb7WJOolrvexR6YJ2YcirYZUXffMyMCLy2ZKdUTBmpLfmycbsz4QxxCHpQCsUz7nswPNnEZNJ9JXyeBRb3N6zIvcUHbbCPlWmb5uly0W9pQmWfVsljSplmqvMQiq8rNsl6WhGmBC97KVC7TIpXZZrVOy2S9wjTFsihUgVVdlGKZUoesE63PXSzkYkziNkuzjVwtNFak_fg9JGWF6kSmFvnuAbYyfiq5bVTwVA1HL5apZh_8T5WBgx4_qu7Prfbw-b5L2PymEovB6e0H_RNNzLen3tnvpMJ908xRnLfyfwEAAP__8SMQdA">