<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/129441>129441</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            wasm: `__builtin_reduce_and` does not optimize well

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          folkertdev

      </td>

    </tr>

</table>

<pre>

    given this C code

https://godbolt.org/z/YMo1qqccT

```c

#include <stdbool.h>

#include <wasm_simd128.h>

bool foo(v128_t a) { return wasm_i8x16_all_true(a); }

bool bar(v128_t a) {

    v128_t zero = wasm_i8x16_splat(0);

    return __builtin_reduce_and(wasm_i8x16_ne(a, zero));

}

bool baz(v128_t a) {

    v128_t zero = wasm_i8x16_splat(0);

 return __builtin_reduce_and((a != zero));

}

```

I'd expect these all to optimize to

```asm

foo:

        local.get       0

 i8x16.all_true

        end_function

```

or some variation in it. However, the other variants optimize much worse.

```asm

bar:

 local.get       0

        v128.const      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

        i8x16.ne

        local.tee       0

        local.get 0

        local.get       0

        i8x16.shuffle   8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 0, 1, 2, 3

        v128.and

        local.tee 0

        local.get       0

        local.get       0

 i8x16.shuffle   4, 5, 6, 7, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3

 v128.and

        i32x4.extract_lane      0

        i32.const       0

 i32.ne  

        end_function

baz:

        local.get       0

 v128.const      0, 0, 0, 0

        i32x4.eq

        v128.any_true

 i32.const       -1

        i32.xor 

        i32.const       1

 i32.and 

        end_function

```

Binary size is especially important for wasm, and it looks like `__builtin_reduce_and` just does not optimize well (I suspect the same is true for `__builtin_reduce_or). 

s390x has the same limitation https://github.com/llvm/llvm-project/issues/129434, so maybe some work can be shared between backends? 

This came up while working on the rust standard library, which would rather use the generic implementation of operations than a target-specific one.

</pre>

<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJysls2O4jgQgJ_GXEoTOXYIcMiB7l60c9jbXPaEHKcgnnZs2nb46adf2YGmFzJsSzsRKhSnqvzVjysR3qutQazI9IlMXyaiD6111cbqV3Shwf2kts2p2qo9Ggit8vAM0jZI6JLQZRvCzhO-JGxF2Gprm9rqkFm3JWz1Ttjq779s_vYm5Y9BnZR0-Ml4w7gyUvcNAuHPPjS1tTprCf_j7uFB-G7tVdfkbP6hQZfRADbWEjbf52y-DiAIWwCZPYHD0DsDyVDNj3m5Flqvg-uRsHnUIvwJyOzlk6NauHtHhC4BAM6r7-gsEP7y2a_faREIm9PB6dngvP96XfdKB2XWDpte4lqYhrD5J3NzBnpOzqOPi5tbuvffRPcYLdIAYXn08wuiSxEHuu-EzRrA4w5lgNCiRxBaQ7Bgd0F16h0h2JvyC98RuoyV48tzDPHSVgqdbTGc7-MOkOLIPqp31UbTrDe9kUFZc49lHXjbIeyFUyKqgDKgQgZ_2gPu0cWMhxbBhhbdoGWCvzJ3vWzhYJ3HbBQ-dssAP0p9vmJlMmmND5dn7Pm3ies2Q44M3uUyII5QXYnHV-8thg182282OnqcR4JFFHmCyfMkWZI8ySLJ6Qdv0kgKfHTtJmmxG8ei-Srxg166hpEYE2IZxezLsL8IYIRccXYsMjwGJ2RYa2FwJLucfW6SMytnWVR-0PGxC9-_cIT-swvvcN_uy3H6OH-3uN_ym1iO1sGD8PKLE2GaR-H9-0A_KSPcCXw8nMoD-h1KJbQ-gep21gVhAmysS8MvRhWdqwDa2lcPWr0ikJKODr2Sws_eB2gsejA2XIfAAbUGwubfwff-MuDAiy4hxHykLcccW0fYIoOB3fMFPUIr_NVeq06FYTDdvERVaPs6k7YjbKX1_vL3befsT5SBsJXyvkdP2Cpni4KnJvYWOnGqcRh6B-teQQoDcaEVDhuoMRwQDdRCvqJpPOGrM9yP-FKXkanfwaFVerBXZgvWJGAXs-ODMI1wDWhVO-FOcddDq9KU7HUDTqRJ2ntMNls06JSMxdHYoTnHajdgd-jSTcyGMCAgCLfF8C1VdKMkWJOG7qSpeLPgCzHBKp8VdMEpLdmkrcr5rCj5ZlE3osyn00UuczYvOJZ1OS0wlxNVMcqmlFOWT4ucsqymJZ0VQhYFxZLyKSkodkLpLGY2fq9MUk6rmNAin2hRo_bpk4gxgwdITwlj8QvJVakcdb_1pKBa-eCvboIKGqvUg3z5qOPGm23SO139v34o8sm-Yv8EAAD__0Tv1sE">