[llvm] [AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (PR #130577)
via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 12 19:53:27 PDT 2025
================
@@ -65,8 +65,12 @@ define <4 x i16> @and_mulhuw_v4i16(<4 x i64> %a, <4 x i64> %b) {
;
; AVX512-LABEL: and_mulhuw_v4i16:
; AVX512: # %bb.0:
-; AVX512-NEXT: vpmulhuw %ymm1, %ymm0, %ymm0
-; AVX512-NEXT: vpmovqw %zmm0, %xmm0
+; AVX512-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1
+; AVX512-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT: vpmovqd %zmm0, %ymm0
+; AVX512-NEXT: vpmovqd %zmm1, %ymm1
+; AVX512-NEXT: vpmulhuw %xmm1, %xmm0, %xmm0
+; AVX512-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
----------------
Shoreshen wrote:
Hi @nikic , I've been read through the x86 backend and there is a pmulh optimization as follow:

The conditions between the two functions are bit different, but generally:
1. the pattern is like `lshr(mul a, b)` or `trunc(lshr(mul a, b))` or same pattern with arithmetic shift
2. Is x86 backend and `Subtarget.hasSSE2()`
3. is vector and element bit size >32
4. shift right by 16 bit
5. operand of mul only valid in lower 16 bit
It will replace the original mul with mulhu/mulhs instruction with narrower bit.
By replacing the 64 bit mul into 32 bit, we break condition 3 and so blocked this optimization during dag selection.
What I'm thinking is probably I should move this optimization into amdgpu-codegenprepare. I can block the x86 back and but I don't think this is the correct way of blocking specific backend in general LLVM code.
https://github.com/llvm/llvm-project/pull/130577
More information about the llvm-commits
mailing list