arsenm added a comment. A more correct way to optimize this would be to have a CallGraphSCC pass that propagates the uniform-work-group-size attribute to callees only reachable from kernels with uniform-work-group-size https://reviews.llvm.org/D50200