<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/98908>98908</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [NVPTX] Optimization causes threads to erroneously enter if-block
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          psalz
      </td>
    </tr>
</table>

<pre>
    When implementing an optimized atomic counter using warp-level primitives (as e.g. described [here](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/#opportunistic_warp-level_programming)), LLVM appears to erroneously enter a branch with every thread in a warp, even though only a subset of threads should be entering it.

I've reproduced the bug with Clang 14, Clang 17 as well as a recent manual build (59e56ee). The bug appears only when optimizations are enabled with `-O1` or higher. The program works correctly using `nvcc`.

Full reproducer:

```cuda
#include <cuda_runtime.h>

#include <cassert>
#include <cstdint>
#include <cstdio>
#include <vector>

#define CUDA_CHECK(fn, ...) \
        { \
                cudaError_t status = fn(__VA_ARGS__); \
                if(status != cudaSuccess) { \
                        fprintf(stderr, "CUDA Error in call %s on line %d: %s\n", #fn, __LINE__, cudaGetErrorString(status)); \
                        abort(); \
                } \
        }

__global__ static void kernel(const int* input, uint32_t* counters) {
        const uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
        int data = input[idx];
        uint32_t* counter = &counters[data];

        auto active_group = __activemask();
        auto mask = __match_any_sync(active_group, data);

        auto leader = __ffs(mask) - 1;
        /* volatile */ uint32_t res; // making res volatile fixes the problem
        auto laneid = threadIdx.x % 32;
        if(laneid == leader) { // this branch is entered by every thread
                res = atomicAdd(counter, __popc(mask));
                printf("i am %d (tid=%d), leader is %u\n", laneid, threadIdx.x, leader);
        }
        res = __shfl_sync(mask, res, leader); // removing the warp shuffle altogether also fixes it
}

int main() {
        constexpr size_t buffer_size = 32;
        constexpr size_t input_size = 32;
        std::vector<int> input_data(input_size);
        int* input_d;
        CUDA_CHECK(cudaMalloc, &input_d, input_data.size() * sizeof(int));
        const auto max_values_per_slot = (input_data.size() + buffer_size - 1) / buffer_size;
        for(size_t i = 0; i < input_data.size(); ++i) {
                input_data[i] = i / 2;
        }

        uint32_t* counters_d;
        CUDA_CHECK(cudaMalloc, &counters_d, buffer_size * sizeof(uint32_t));
        CUDA_CHECK(cudaMemcpy, input_d, input_data.data(), input_data.size() * sizeof(int), cudaMemcpyHostToDevice);

        const unsigned block_size = 32;
        const unsigned grid_size = (input_size + block_size - 1) / block_size;
        kernel<<<grid_size, block_size>>>(input_d, counters_d);

        CUDA_CHECK(cudaFree, input_d);
        CUDA_CHECK(cudaFree, counters_d);

        return 0;
}
```

The program initializes a buffer of size 32 with 16 consecutive pairs of values (`0, 0, 1, 1, 2, 2, ...`). It then uses `__match_any_sync` to group threads that have the same value (i.e., 0 and 1, 2 and 3, and so on). Only the first thread in each group should then enter the block that does the `atomicAdd`. However, this is the output:

```
i am 0 (tid=0), leader is 0
i am 1 (tid=1), leader is 0
i am 2 (tid=2), leader is 2
i am 3 (tid=3), leader is 2
i am 4 (tid=4), leader is 4
i am 5 (tid=5), leader is 4
...
```

What it should look like:

```
i am 0 (tid=0), leader is 0
i am 2 (tid=2), leader is 2
i am 4 (tid=4), leader is 4
...
```

Since either marking `res` as volatile or removing the call to `__shfl_sync` altogether fixes the issue, I suspect that LLVM thinks it is cheaper to compute `res` in each thread instead of calling the warp collective - somehow disregarding the fact that both `atomicAdd` and `printf` aren't side-effect free..?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysON2OozjWT-O6OUpETEglF7lIJV0zrW9m-tN2b8_eIQOH4C1jI9ukqvrpV8cmgVS6ameklSIC5vz_H4Rz8qgRtyx7YNnhTvS-MXbbOaF-3BWmet3-2aAG2XYKW9Re6iMIDabzspU_sALhTStLKE2vPVroHUE8C9vNFJ5QQWdlK708oQPG18IBzo9zqNCVVhZYAcseGrTIsgPj68b7zrF0x_gj448VETAd2rk-yUqKeWlaxh8LZY6MPwZOs7KvxGxkNxvZBRqp6Tpjfa-l87LMR8C8s-ZoRdtKfWR8E357-O2377-D6DoU1oE3gNYajaZ36hUw6CegsEKXDTxL3wCe0L6CbyyKCqQGETQnSnhCDb4x_bEBo9UrCHB94dCDqQcEB64xvaqgwEicLCf9nCUHluzi9TPj9ycEi501VV9iBb5BKPpj5L9XQh9hsSSOw_09CAfPqBT9C7BYovbQCt0LBUUvVUV-yDaYrRAZ38zh20DxrHcQ95m8PnhZeGm0A2FJTlEorCJ3tkpmXxZslYCx0MhjgzZSG2wLz8Y-OSiNtVh69ToEB1sl-lSWbJVcqfrYKzUqaikKJm_ZKok_cvhwxFOpS9VXCCzd03lue-1li_OGpZ-usK9BhXNo_Qhz_db5SuoP35p3Xp6w9Mbe8q6wlhph_8_DLt__-mn_f4yva01Om8_njG-AZfszyobdP1w_JxvS7ZO1xuYenBe-d8DSAxCJdZ5_3-W7f_zyNc8piNMbZFkzvj5j8QUhEr2vfVmic4H7LUeWbOrOSu0jcoXWkriMc1ICgjAU8KVQChjPKGpAkZKMZxVLd-GQZXvNOI-YadQ4z3_7_McnEnYf5PgFfaD21duQioOoQ07eqsOSjSiM9Yyv3wG4P7yx52Hqjjw_KlMIlefBlLKEk5EVPKHVqBhfl0Y7DxQAfAdSd70nSXupfcrzcDiUurPpLowi5hkSZPUSvFQoUz59rl7mL0DY4fEg2_D4MJSC8JqlIy2pPVTCi0AhSpE9yOqFyuQE7CdiBQzGVxcpswcidI04oIveGxAllcv8aE3fBeQ8j0etcE8XK1_j0KsBthW-bHKhX3P3qksq8RN6ZLrA_YrGlJJCUQ1C53ldO8bXke8GZrCYMg71fAcno4SXigKN2sRob4suRENoHdCKJ6o1Ft2IUcsXdKGAdtYUCts3ogiNsgqiTLxCgQwpv3IOJcUITQhRjUsyRRl8I925XUgXazxWULxe9Y1p7JK4RC421V1VhYAMjoy505muHE30xjUs2VxylnEuQbQhHange1mRqJScsdUNhpdUFLJ-kqlRMbqbxuYF4y3PMbsu0ue5a2p1joco6z745w2Vs6EstuZE7iLfUAcF1_R1rRCE8uaIvqHOq5wZXCj9wPM6tWVodFLHoL3NTXzpLDj5A3MPRV_XaHN6CjJfe_gGPKTge9DOU8Vj6e5c__exfwxYMQHWI4k3JpzWmryavrlqGFQsfxdKmTLW09UZge8njOaRQTQA3wX5TR3Y-9uIiSVrSOmX_CRUjy7vyDDK-KGUrN-j_nBlRUrXcPw4PZ5yq42l-j5YNFBPKAro9h0dYpA8MP4g37o0WO5i3-xBsuwQy2WQgb8Tpe9XTvd3jD_B4fvraJpafWTz1vS35LEtu9eJN984dgijIX3_hsdjn43kfzXOfzMHPMkSf1qUhyamw1ZQxW71YZaMsEcrqxH0KuJjsIy0prFyOZ2SHtpxuo-_C-lg7AnGp-F3CdKg7sQ3P1HxxvKPFvHK7h876gz-X9hY9L3VIcLfVqvzLDvFmM7NUksvhZI_kEb4GFy0NwTbpTxO34sVkAOw7KnfQickze41xCQm-7NVkpCg4bK4XPjlQtPnKglLwGdPxVdD7wh3ldw09lVCC1GcEs77i2-Eh0acMBRuJ1qM3IPz5zgPzEHoamAcblO6pRtnwOjA_AttHESiltb5yT6FomwGnsOuFISMq1jYhCgWohyVGZo7WyVj_1wlc_jVPFPDjS1NOup5BGd6T3PVO3vG0FKohyZjA01uuucUcDECLj4E5CMgvwHkE8B0BEw_BFyOgMsbwOUEMBsBs_cAKS7ej9Q_ydrSnz2ijHkCJZ_wf2nJv2ygv6L3x-p8lbpEQBmGjFbYp2FHpXllldAWfRkgjb0eVcL2403Ml3HmIaxxbhmnTulcHyrHZ3C967D0MXLDVwffSP1Eow0JXjYoOgpxA6Vpu97jRKJzXlzSxHn6N3WQ52qMKo1SGOZxmIEzLTbmGSrpLB6Frc6gtThLUpi41k_zJ-QqWyXDaEkHFjXj9x6crHCGdU2a1BZxPmfp4121TatNuhF3uF3c80WS8Sxb3jVbvtpgtqzL-6zMshVfpeuyXFebainSChNc3cktT_gyuV9ki2WyWSznRZlusmK9KDfrtFhixpYJtkKquVKndm7s8S6YdLtZb5L1nRIFKhc-ZHGu8flsb86yw53dEs6s6I-OLRMlnXcjFS-9Cl_A_vj-_9_-RXPEl8mXDyhFqIqXqvez70KynoVadNdbtb3-jHWUvumL4eMV8Rz-Zp01_8bSM_4YJHWMP0ZNTlv-nwAAAP__1LHxtQ">