[llvm] [AMDGPU] Enable vectorization of i8 values. (PR #134934)

Fri Apr 18 10:00:47 PDT 2025

================
@@ -126,24 +126,24 @@ define amdgpu_kernel void @add_i16() #0 {
 define amdgpu_kernel void @add_i8() #0 {
 ; ALL-LABEL: 'add_i8'
 ; ALL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %i8 = add i8 undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v2i8 = add <2 x i8> undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v3i8 = add <3 x i8> undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %v4i8 = add <4 x i8> undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v5i8 = add <5 x i8> undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v6i8 = add <6 x i8> undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %v32i8 = add <32 x i8> undef, undef
-; ALL-NEXT:  Cost Model: Found an estimated cost of 66 for instruction: %v33i8 = add <33 x i8> undef, undef
+; ALL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v2i8 = add <2 x i8> undef, undef
+; ALL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v3i8 = add <3 x i8> undef, undef
+; ALL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v4i8 = add <4 x i8> undef, undef
+; ALL-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v5i8 = add <5 x i8> undef, undef
+; ALL-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v6i8 = add <6 x i8> undef, undef
+; ALL-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v32i8 = add <32 x i8> undef, undef
----------------
jrbyrnes wrote:

The cost of v32i8 add should be 32 

https://godbolt.org/z/o5W1Y1oer

In the ISA, we don't have an instruction to do vectorized i8 adds. In the example, we see that the v32i8 vector add is scalarized into a bunch of individual i8 adds. Given that we need 32 instructinos, the cost of the v32i8 add should be 32.

The result of changing the cost to 8 is that the vector version has a cost savings of 24 over the scalar version. Thus, for a vectorizeable chain, the vectorized version can degrade relative to the scalar version by a cost of up to 23. Since the cost savings of 24 doesn't actually translate to improved assembly, this means we will allow code degradations for no benefit. 

https://github.com/llvm/llvm-project/pull/134934