<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/132204>132204</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[clang] [CUDA] No --no-cuda-include-sass option available to include only PTX code in fatbin
</td>
</tr>
<tr>
<th>Labels</th>
<td>
clang
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
mkuron
</td>
</tr>
</table>
<pre>
dde3dc27ee71f12eb145ce54158779ab4ddc38ed added options `--no-cuda-include-ptx=sm_XX`/`--cuda-include-ptx=sm_XX` to control whether the PTX assembly code is embedded into the CUDA fatbinary or not. The default used to be to include all; since Clang 19 (#84367) it includes none.
Sometimes it is also desirable to control whether the SASS executable code is embedded, so there should be corresponding `--no-cuda-include-sass=sm_XX`/`--cuda-include-sass=sm_XX` options. They should default to including all for consistency with the current behavior. An error should be raised when an architecture is enabled via `--cuda-gpu-arch` but the other flags specify that it should be embedded neither as PTX nor as SASS. When the flag to not include SASS for a specific architecture is specified, the invocation of `ptxas` for that architecture should be skipped and the call to `fatbinary` should skip the `--image` flags for the architectures.
There are two reasons that come to my mind why someone might want to embed PTX but not SASS: 1) It saves time during the build process as `ptxas` can at times be painfully slow; this is particularly relevant when the code includes a large number of CUDA kernels but only a subset of these is executed at runtime. 2) If one is, for whatever reason, forced to build with an older version of the CUDA SDK that doesn't support the hardware architecture that the executable is supposed to run on, e.g. CUDA 12.2 when running on the latest Blackwell GPUs, Clang currently forces you to embed SASS code for Hopper, even though the Blackwell GPU will only use the Hopper PTX code.
`nvcc`, for comparison, already gives you the kind of full control that I'm requesting here, see https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#generate-code-specification-gencode.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJyEVk1v5DgO_TWuC1GFsuyKk0Md8oHsNhZYNJBubN8WtETbmsiSRx-u9r8fUHYlnZ6emUvKiCny8b1HyhiC7i3RuTg9FKenHaY4OH8eX5N3dtc6tZyVokpJ0RA1ZVcKasv6JOlUl6fbprnDtlZKVrekAJUiBW6K2tkAxc1xv7duL5PCvbbSJEX7KX4vqqcw_v_bt-LmWIjnHPU3IRAdSGejdwYuA8WBPMSB4POXb4Ah0NiaBaRTBDoAjS1lDNpGl8Mevz7dQ4ex1Rb9As6DdfEAXwYCRR0mEyEFUlylJf67oQA0pqgeIGgrCR4N2h7KOyjEbSGq27q6aQpxBzpe4wNYZ-lQHO-L4_2LGynqkUIOCIAmOFAUtMfW0F919HL_8gL0nWSKOeznpgrxCCF35QnC4JJRjFk67ylMzipt-1-THjCEf2L9p5iripmq5VruStkbT1wSjYHOee4p6BDJygUuOg65KZm8JxuhpQFn7fwB7i2Q987_0IJHzRpcBrKAFtDLQUeSMfmVAMt8KJg1wjvwfkp7jmSwbYq5mstkdgb7AGEiqbsF4oCRdXgv9-YSSzofwJDtZF1-ZB0O8D8Gwzk5Gzds3ZvYq1TcM25ltPwT6u3Fqhsn0nZ2EplVcB03MsXvGBg-Z8owP-R4Bxxe9TTxfFm1ksqUR8c53qzNebYTHJ7jMll6xJ5ylUzLWos-lAqbb79ka6EniBcHnjDwHGdk0o3ZuOMCo7as1QLBjeQswaj7IcIFbTZGZjfzyaowa8xWUd1DySPzKULAmQLwgIBKni3EgNqkjYLJO0khsA4_MiTZFhHWoWoJJtS2S8YsEIy78KDGQQemfUIftUwGvVnAk6GZcV2uaq5DdZ1ZBIO-J7BpbMmzLHlfvJK3ZEJuwFmzsMypDRQ5Ig4UVlvmUWVZIvhkGdsBRO6xA-ZFB5aeCb8MGGkmv3G6_Vtueyc3nicGLTijyMNMPmxGedtiL0__WbVQjoItRBMhpGlyfvX-gF5d0H9Udj3Ar39YLGxOPrjtPZ8srJjo0B_WWqU4iJU0n6xlidzKn8FIIcKDQfl6IWPgX5-_5jbXFbmNu1nW_gIsLr2bIo9NVoBJ-bebJvK57pzVcalfl8aH7HDRxqwypED5_Xoye4yzbe4tbo52ljIvuMdtI40Ter0xjsYTqgV6PV-BDQSv7GbXAbvpbS1n1j4VohnB0--JQmQKeDryEiaCIcYpFNU971LxrJwMBztrpfEg3ViIZ15R28-eYWhDfq-8nsnvM0o-V_VkyWOkPbexv66SvCP2Pdnc3E6dK3VX3eGOzmVTi7o51ad6N5xJHTspjnflTXmL1JVNVxNSVWFdyQZrsdNncRSnYyWOpTg1VX2QVOJdXdZHVaum6cqiPtKI2hyMmceD8_1Oh5DoXFZCHOudwZZMyB8GQkjWtxCCvxH8mQ_s29SHoj4aHWJ4TxF1NPlrYj1xeoLi9MCm4sf_Ovj19bTdN4AzanO9Ja_rNot_VRu03S70XfLm_FGIXschtZsGDGn72U_e_UYyFuI5txgK8bx1OZ_FHwEAAP__89IZRQ">