<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/55455>55455</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            clang: -ftime-trace output does not include device compilation
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Maetveis
      </td>
    </tr>
</table>

<pre>
    As the title says, it would be nice when using offloading targets (like CUDA / HIP / OpenMP) the final time trace would include the traces from the device side compilations.

These files are already being generated (because the spawned processes inherit the -ftime-report flags), but are not merged to the final trace. In the case of CUDA they can be found under `/tmp/<file>-<hash>/<file>-<arch>.json`, under HIP they seem to be overwritten as they use the same filename as the host compilation.

Having clang merge these files would help usability for tools like [ninjatracing](https://github.com/nico/ninjatracing) (with the `--embed-time-trace` option) when used with CUDA/HIP that are not aware of the multiple compilation passes involved.

To reproduce, compile the following file with `clang main.cu --offload-arch=sm_35 --offload-arch=sm_80 -ftime-trace". 

```cuda
#ifdef __CUDA_ARCH__
template <unsigned int N>
struct SlowToCompile {
    constexpr static unsigned int value = N + SlowToCompile<N - 1>::value;
};

template <>
struct SlowToCompile<0> {
    constexpr static unsigned int value = 0;
};

#endif

void __global__ kernel(unsigned int* dst) {
#ifdef __CUDA_ARCH__
    *dst = SlowToCompile<1024>::value;
#endif
}

int main(int argc, char** argv) {
    if(argc < 0) {
        kernel<<< dim3(1), dim3(1) >>>(nullptr);
    }
}

```

I would like to help fixing this and would like to hear your opinions on my proposed approach:
Add an option to the `-cc1` command line to merge the time trace output to an already existing file. The offloading drivers could then add this to the device compilation jobs when time trace is enabled. 
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJydVsFy4jgQ_RpzUUEZGwI5cCCwqeQw2andmTMlWzJWRpZckgzD3-9r2SQm2c1hKWLklrr79etWdworLputZ6GWLKigJfP84pNsx1RgZ9tpwQrJjColO9fSsM4rc2S2qrTlgpaBu6MMniXZWqtfku1-7rd4eWRPz9_j75-tNN--J9l99FEpwzU8NXDnOFmNPpQpdSdkD4PknlXONvFdyBO59wr7pW1apXlQ1vhZku6TdNs_f9TSk3UNTe4k49pJLi4ATyCP0kjHgxQEs5Al73zvy7f8bCBunYVPD2VlaukQO-1OKwI6dbK1LrBK8yOYuSdyii5EN8YG1kgwIFiw4wAphhl7NlFWcrizVc8NBBdIDPFa2c4Ihj_pWHKXgq7QtHgm-Y5CSfI_pljW3NdYfpJzV5J89uqtidq7wRRRH914KRsCBlf2JN0ZgQUkkft--40F3vTcGVr0u6y2Poz5vqH7iZ-I11JzPCMBpPOWgT6ptdQtfPBCaRUuCNYBi9WexUJJlg9GmVdOVMFWstwjN3UIrU_yLcWaPR5VqLtiBhB4QQ3a-DPSQVFB6YxjETJImE5lU0gxjYmLWYCQ2ZYioONDESNfUYsyAqM9Yfw9p_xMK6SMzDadDqrVN9XHWj6Uy8nqkxS3xWgZasZZ0cE9stLr9VxXVmt7JvKIqh4FIA5McmVmZcem0-GGTfsc731zyJf_Kl6n1zLto82yGRtjocKI37ITfBBluaqErNjhQPEftn_tng6Hfi_IpkWE4DLfdcarI90OZQJ7oQqMR3xwXRnY3wjjh90NoSWrh36X4VPidgb5u3XMB7BVshtLJ647sr9nL8jew60huH1hUzYnbyiDfBtPJ_lgPVnt39efAH8JEfspDvw_pOkXCECnNEJVY9nJKgF6j9oWXB8O7Jd0RmrU6th-km2Z8CFW8RXUV7khwNCBSsT0Mbp5mi3-i7ZbiAhhhJUipcIDOlqin5exaGvu4I1AQnS6QUlIYC1b02FiHgR93KfPEDZlhr5MqCaH1nzooqNXFpHHb7Y2ndZtcHQqH1l8x30bwFuJj4XPQxOKvQYtMDajSv2OM6tWmBLovB-PcMcutnPoF8rQiGG46M2FpkNrqWfwFktOV2_wshUQmqG_XEcAtaGynFPjwc1vyJFWJrp4a5XjEWi70GKeYBumroNL_lY-XNvEjGG-jaeucAr93MM-BRCoqXEh-sAGFMPYHPesV1v4vgOOvEMDfb_Q6GFsIja5uM_v-ST-L7CJXQnB3rSYK15h0f-oWV6H92ePk87pzRc9XevT9WcKZl9lidvwqLzvJCbt43K5WC4n9SatVuVa8kW-LLNyvl6KdXov5ry6u7svZDWXE80Lqf0GIwX9z8gziyawxlSZqE2WZlm6nC_S1QKaM75aCDFfSZGXvFiVd8kilbgAekY4ZtYdJ24TIRUdBv4i1ciEf99E46crLKM72OddqK3bfOMynKTyk-h7E7H_A8KlALk">