[llvm] [WebAssembly] Mask undef shuffle lanes (PR #149084)

Wed Aug 6 05:33:06 PDT 2025

sparker-arm wrote:

I've applied #146864 on top of wasi-sdk and used libyuv as a guide, because it can vectorize in weird and wonderful ways.

By simply searching from extend_low and extmul_low operations, to determine whether the high half of a shuffle is required or not, it's possible to get significant speedups:

```
Benchmark                                     Speedup(%)
------------------------------------------  ------------
libyuv-ARGBScaleDownBy2_Bilinear-run_times        21.418
libyuv-ARGBScaleDownBy2_Box-run_times             21.705
libyuv-ARGBScaleDownBy2_None-run_times            15.899
libyuv-ARGBScaleDownBy4_Box-run_times             22.627
libyuv-ARGBScaleDownBy4_Linear-run_times          -0.084
libyuv-ColourI420-run_times                        1.825
libyuv-ColourI422-run_times                        1.991
libyuv-ColourJ420-run_times                        2.04
libyuv-ColourJ422-run_times                        1.972
libyuv-NV12ToI420-run_times                      151.189
libyuv-NV21ToI420-run_times                      152.743
libyuv-P010ToI010-run_times                        8.704
libyuv-P012ToI012-run_times                       10.693
libyuv-UVScaleDownBy3by4_Linear-run_times          0.036
libyuv-UVScaleDownBy3by4_None-run_times           -0.2   
```

So, I think, with the revised approach to memory interleaving, using this undef AND mask hack isn't really going add anything, apart from extra work for the runtimes.

It would still be really nice to have this information encoded in the instruction though.

https://github.com/llvm/llvm-project/pull/149084