Artem-B wrote: Indeed. According to https://arxiv.org/pdf/2208.11174 `BFI` is much more expensive than `PRMT` which appears to take just 1 cycle on A100:  https://github.com/llvm/llvm-project/pull/110766