[llvm] [X86] Attempt to use VPMADD52L/VPMULUDQ instead of VPMULLQ on slow VPMULLQ targets (or when VPMULLQ is unavailable) (PR #171760)

Sun Feb 22 20:59:55 PST 2026

mahesh-attarde wrote:

> > This pull request introduces a new tuning flag "TuningSlowPMULLQ" and uses it to optimize 64-bit vector multiplication on Intel targets where "VPMULLQ" is slow.
> > On recent Intel microarchitectures , the "VPMULLQ" instruction has a high latency of 15 cycles . In contrast, the "VPMADD52LUQ" instruction (available via AVX512IFMA) performs a similar operation with a latency of only 4 cycles .
> > Reference data from uops.info (Ice Lake): "VPMULLQ" : Latency 15, TP 1.5 "VPMADD52LUQ" : Latency 4, TP 0.5
> > @RKSimon FIX [#158854](https://github.com/llvm/llvm-project/issues/158854)
> > Sorry it took so long.
> 
> @houngkoungting Can you help with equivalence for ISA mentioned. `VPMULLQ` is 64 bit calculation for signed inputs and `VPMADD52LUQ` is for 52 bit unsigned inputs.

i see you have used 12-bit as 0 for transform,
https://github.com/llvm/llvm-project/pull/156714/changes#diff-eb2f176d67cdf1955a90e71e25d6d39910d723d4e0b8a9bf8dfa229d3a6b2c1eR57973

https://github.com/llvm/llvm-project/pull/171760