[all-commits] [llvm/llvm-project] 264d11: AMDGPU/GISel: Introduce custom legalization of G_MUL
Nicolai Hähnle via All-commits
all-commits at lists.llvm.org
Thu Jun 9 04:39:27 PDT 2022
Branch: refs/heads/main
Home: https://github.com/llvm/llvm-project
Commit: 264d1136f9ae0b80b59b174bbe877d680c33143b
https://github.com/llvm/llvm-project/commit/264d1136f9ae0b80b59b174bbe877d680c33143b
Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
Date: 2022-06-09 (Thu, 09 Jun 2022)
Changed paths:
M llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
M llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
M llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-mul.mir
M llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-sdiv.mir
M llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-srem.mir
M llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-udiv.mir
M llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-umulo.mir
M llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-urem.mir
M llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
M llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll
M llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll
M llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll
M llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
Log Message:
-----------
AMDGPU/GISel: Introduce custom legalization of G_MUL
The generic legalizer framework is still used to reduce the problem
to scalar multiplication with the bit size a multiple of 32.
Generating optimal code sequences for big integer multiplication is
somewhat tricky and has a number of target-specific intricacies:
- The target has V_MAD_U64_U32 instructions that multiply two 32-bit
factors and add a 64-bit accumulator. Most partial products should
use this instruction.
- The accumulator is mapped to consecutive 32-bit GPRs, and partial-
product multiply-adds can feed the accumulator into each other
directly. (The register allocator's support for that is somewhat
limited, but that only matters for 128-bit integers and larger.)
- OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator
to be stored in an even-aligned pair of GPRs. To avoid excessive
register copies, it makes sense to compute odd partial products
separately from even partial products (where a partial product
src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both
halves together as a final step.
- We can combine G_MUL+G_ADD into a single cascade of multiply-adds.
- The target can keep many carry-bits in flight simultaneously, so
combining carries using G_UADDE is preferable over G_ZEXT + G_ADD.
- Not addressed by this patch: When the factors are sign-extended,
the V_MAD_I64_I32 instruction (signed version!) can be used.
It is difficult to address these points generically:
1) Finding matching pairs of G_MUL and G_UMULH to find a wide
multiply is expensive. We could add a G_UMUL_LOHI generic instruction
and conditionally use that in the generic legalizer, but by itself
this wouldn't allow us to use the accumulation capability of
V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE
post-legalization, but this is also expensive.
2) Similarly, making sense of the legalization outcome of a wide
pre-legalization G_MUL+G_ADD pair is extremely expensive.
3) How could the generic legalizer possibly deal with the
particular idiosyncracy of "odd" vs. "even" partial products.
All this points in the direction of directly emitting an ideal code
sequence during legalization, but the generic legalizer should not
be burdened with such overly target-specific concerns. Hence, a
custom legalization.
Note that the implemented approach is different from that used by
SelectionDAG because narrowing of scalars works differently in
general. SelectionDAG iteratively cuts wide scalars into low and
high halves until a legal size is reached. By contrast, GlobalISel
does the narrowing in a single shot, which should be better for
compile-time and for the quality of the generated code.
This patch leaves three gaps open:
1. When the factors are uniform, we should execute the multiplication on
the SALU. Register bank mapping already ensures this.
However, the resulting code sequence is not optimal because it doesn't
fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32
doesn't have a carry-in.) It is very difficult to fix this after the
fact, so we should really use a different legalization sequence in
this case. Unfortunately, we don't have a divergence analysis and so
cannot make that choice.
(This only matters for 128-bit integers and larger.)
2. Avoid unnecessary multiplies when sources are known to be zero- or
sign-extended. The challenge is that the legalizer does not currently
have access to GISelKnownBits.
3. When the G_MUL is followed by a G_ADD, we should consider combining
the two instructions into a single multiply-add sequence, to utilize
the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has
multiple uses and the implied duplication of the multiply is an
overall negative). However, this is also not true when the factors
are uniform: in that case, it is generally better to *not* combine
the two operations, so that the multiply can be done on the SALU.
Again, we don't have a divergence analysis available and so cannot
make an informed choice.
Differential Revision: https://reviews.llvm.org/D124844
More information about the All-commits
mailing list