linuxlonelyeagle wrote: The above example reflects the situation the document,https://docs.nvidia.com/cuda/parallel-thread-execution/#async-warpgroup-smem-layout-32b-k https://github.com/llvm/llvm-project/pull/152160