<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/129441>129441</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            wasm: `__builtin_reduce_and` does not optimize well
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          folkertdev
      </td>
    </tr>
</table>

<pre>
    given this C code

https://godbolt.org/z/YMo1qqccT

```c
#include <stdbool.h>
#include <wasm_simd128.h>

bool foo(v128_t a) { return wasm_i8x16_all_true(a); }

bool bar(v128_t a) {
    v128_t zero = wasm_i8x16_splat(0);
    return __builtin_reduce_and(wasm_i8x16_ne(a, zero));
}

bool baz(v128_t a) {
    v128_t zero = wasm_i8x16_splat(0);
 return __builtin_reduce_and((a != zero));
}
```

I'd expect these all to optimize to

```asm
foo:
        local.get       0
 i8x16.all_true
        end_function
```

or some variation in it. However, the other variants optimize much worse.

```asm
bar:
 local.get       0
        v128.const      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        i8x16.ne
        local.tee       0
        local.get 0
        local.get       0
        i8x16.shuffle   8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3
        v128.and
        local.tee 0
        local.get       0
        local.get       0
 i8x16.shuffle   4, 5, 6, 7, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
 v128.and
        i32x4.extract_lane      0
        i32.const       0
 i32.ne  
        end_function

baz:
        local.get       0
 v128.const      0, 0, 0, 0
        i32x4.eq
        v128.any_true
 i32.const       -1
        i32.xor 
        i32.const       1
 i32.and 
        end_function
```

Binary size is especially important for wasm, and it looks like `__builtin_reduce_and` just does not optimize well (I suspect the same is true for `__builtin_reduce_or). 

s390x has the same limitation https://github.com/llvm/llvm-project/issues/129434, so maybe some work can be shared between backends? 

This came up while working on the rust standard library, which would rather use the generic implementation of operations than a target-specific one.

</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJysls2O4jgQgJ_GXEoTOXYIcMiB7l60c9jbXPaEHKcgnnZs2nb46adf2YGmFzJsSzsRKhSnqvzVjysR3qutQazI9IlMXyaiD6111cbqV3Shwf2kts2p2qo9Ggit8vAM0jZI6JLQZRvCzhO-JGxF2Gprm9rqkFm3JWz1Ttjq779s_vYm5Y9BnZR0-Ml4w7gyUvcNAuHPPjS1tTprCf_j7uFB-G7tVdfkbP6hQZfRADbWEjbf52y-DiAIWwCZPYHD0DsDyVDNj3m5Flqvg-uRsHnUIvwJyOzlk6NauHtHhC4BAM6r7-gsEP7y2a_faREIm9PB6dngvP96XfdKB2XWDpte4lqYhrD5J3NzBnpOzqOPi5tbuvffRPcYLdIAYXn08wuiSxEHuu-EzRrA4w5lgNCiRxBaQ7Bgd0F16h0h2JvyC98RuoyV48tzDPHSVgqdbTGc7-MOkOLIPqp31UbTrDe9kUFZc49lHXjbIeyFUyKqgDKgQgZ_2gPu0cWMhxbBhhbdoGWCvzJ3vWzhYJ3HbBQ-dssAP0p9vmJlMmmND5dn7Pm3ies2Q44M3uUyII5QXYnHV-8thg182282OnqcR4JFFHmCyfMkWZI8ySLJ6Qdv0kgKfHTtJmmxG8ei-Srxg166hpEYE2IZxezLsL8IYIRccXYsMjwGJ2RYa2FwJLucfW6SMytnWVR-0PGxC9-_cIT-swvvcN_uy3H6OH-3uN_ym1iO1sGD8PKLE2GaR-H9-0A_KSPcCXw8nMoD-h1KJbQ-gep21gVhAmysS8MvRhWdqwDa2lcPWr0ikJKODr2Sws_eB2gsejA2XIfAAbUGwubfwff-MuDAiy4hxHykLcccW0fYIoOB3fMFPUIr_NVeq06FYTDdvERVaPs6k7YjbKX1_vL3befsT5SBsJXyvkdP2Cpni4KnJvYWOnGqcRh6B-teQQoDcaEVDhuoMRwQDdRCvqJpPOGrM9yP-FKXkanfwaFVerBXZgvWJGAXs-ODMI1wDWhVO-FOcddDq9KU7HUDTqRJ2ntMNls06JSMxdHYoTnHajdgd-jSTcyGMCAgCLfF8C1VdKMkWJOG7qSpeLPgCzHBKp8VdMEpLdmkrcr5rCj5ZlE3osyn00UuczYvOJZ1OS0wlxNVMcqmlFOWT4ucsqymJZ0VQhYFxZLyKSkodkLpLGY2fq9MUk6rmNAin2hRo_bpk4gxgwdITwlj8QvJVakcdb_1pKBa-eCvboIKGqvUg3z5qOPGm23SO139v34o8sm-Yv8EAAD__0Tv1sE">