[llvm] [AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (PR #130577)

via llvm-commits llvm-commits at lists.llvm.org
Wed Mar 12 19:53:27 PDT 2025


================
@@ -65,8 +65,12 @@ define <4 x i16> @and_mulhuw_v4i16(<4 x i64> %a, <4 x i64> %b) {
 ;
 ; AVX512-LABEL: and_mulhuw_v4i16:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpmulhuw %ymm1, %ymm0, %ymm0
-; AVX512-NEXT:    vpmovqw %zmm0, %xmm0
+; AVX512-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vpmovqd %zmm0, %ymm0
+; AVX512-NEXT:    vpmovqd %zmm1, %ymm1
+; AVX512-NEXT:    vpmulhuw %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
----------------
Shoreshen wrote:

Hi @nikic , I've been read through the x86 backend and there is a pmulh optimization as follow:
![image](https://github.com/user-attachments/assets/c2639355-6c70-4744-af9c-aca1d183ae0c)
The conditions between the two functions are bit different, but generally:
1. the pattern is like `lshr(mul a, b)` or `trunc(lshr(mul a, b))` or same pattern with arithmetic shift
2. Is x86 backend and `Subtarget.hasSSE2()`
3. is vector and element bit size >32
4. shift right by 16 bit
5. operand of mul only valid in lower 16 bit
It will replace the original mul with mulhu/mulhs instruction with narrower bit.
By replacing the 64 bit mul into 32 bit, we break condition 3 and so blocked this optimization during dag selection.

What I'm thinking is probably I should move this optimization into amdgpu-codegenprepare. I can block the x86 back and but I don't think this is the correct way of blocking specific backend in general LLVM code.

https://github.com/llvm/llvm-project/pull/130577


More information about the llvm-commits mailing list