[llvm] [AMDGPU][CodeGenPrepare] Narrow 64 bit math to 32 bit if profitable (PR #130577)
via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 12 09:10:57 PDT 2025
================
@@ -65,8 +65,12 @@ define <4 x i16> @and_mulhuw_v4i16(<4 x i64> %a, <4 x i64> %b) {
;
; AVX512-LABEL: and_mulhuw_v4i16:
; AVX512: # %bb.0:
-; AVX512-NEXT: vpmulhuw %ymm1, %ymm0, %ymm0
-; AVX512-NEXT: vpmovqw %zmm0, %xmm0
+; AVX512-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1
+; AVX512-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT: vpmovqd %zmm0, %ymm0
+; AVX512-NEXT: vpmovqd %zmm1, %ymm1
+; AVX512-NEXT: vpmulhuw %xmm1, %xmm0, %xmm0
+; AVX512-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
----------------
Shoreshen wrote:
Hi @nikic , I think this is because the result of this will only be used by trunc. The original llvm ir before instruction selection looks like this
```
define <4 x i16> @and_mulhuw_v4i16(<4 x i64> %a, <4 x i64> %b) #0 {
%a1 = and <4 x i64> %a, splat (i64 65535)
%b1 = and <4 x i64> %b, splat (i64 65535)
%c = mul <4 x i64> %a1, %b1
%d = lshr <4 x i64> %c, splat (i64 16)
%e = trunc <4 x i64> %d to <4 x i16>
ret <4 x i16> %e
}
```
While the updated one looks like this:
```
define <4 x i16> @and_mulhuw_v4i16(<4 x i64> %a, <4 x i64> %b) #0 {
%a1 = and <4 x i64> %a, splat (i64 65535)
%b1 = and <4 x i64> %b, splat (i64 65535)
%1 = trunc <4 x i64> %a1 to <4 x i32>
%2 = trunc <4 x i64> %b1 to <4 x i32>
%3 = mul <4 x i32> %1, %2
%4 = zext <4 x i32> %3 to <4 x i64>
%d = lshr <4 x i64> %4, splat (i64 16)
%e = trunc <4 x i64> %d to <4 x i16>
ret <4 x i16> %e
}
```
It first trunc 2 operand then trunc the result.
Should I avoid narrow if the result is trunced to a lower bit width??
https://github.com/llvm/llvm-project/pull/130577
More information about the llvm-commits
mailing list