[llvm] [ISel] Introduce llvm.clmul intrinsic (PR #168731)

Thu Nov 20 06:14:23 PST 2025

================
@@ -18389,6 +18387,153 @@ Example:
       %r = call i8 @llvm.fshr.i8(i8 15, i8 15, i8 11)  ; %r = i8: 225 (0b11100001)
       %r = call i8 @llvm.fshr.i8(i8 0, i8 255, i8 8)   ; %r = i8: 255 (0b11111111)
 
+.. _int_clmul:
+
+'``llvm.clmul.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This is an overloaded intrinsic. You can use ``llvm.clmul`` on any integer
+or vectors of integer elements.
+
+::
+
+      declare i16 @llvm.clmul.i16(i16 %a, i16 %b)
+      declare i32 @llvm.clmul.i32(i32 %a, i32 %b)
+      declare i64 @llvm.clmul.i64(i64 %a, i64 %b)
+      declare <4 x i32> @llvm.clmul.v4i32(<4 x i32> %a, <4 x i32> %b)
+
+Overview:
+"""""""""
+
+The '``llvm.clmul``' family of intrinsic functions performs carry-less
+multiplication, or XOR multiplication, on the two arguments, and returns
+the low-bits.
+
+Arguments:
+""""""""""
+
+The arguments may be any integer type or vector of integer type. Both arguments
+and result must have the same type.
+
+Semantics:
+""""""""""
+
+The '``llvm.clmul``' intrinsic computes carry-less multiply of its arguments,
+which is the result of applying the standard Eucledian multiplication algorithm,
+where all of the additions are replaced with XORs, and returns the low-bits.
+The vector variants operate lane-wise.
+
+Example:
+""""""""
+
+.. code-block:: llvm
+
+      %r = call i4 @llvm.clmul.i4(i4 1, i4 2)    ; %r = 2
+      %r = call i4 @llvm.clmul.i4(i4 5, i4 6)    ; %r = 14
+      %r = call i4 @llvm.clmul.i4(i4 -4, i4 2)   ; %r = -8
+      %r = call i4 @llvm.clmul.i4(i4 -4, i4 -5)  ; %r = 4
+
+'``llvm.clmulr.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This is an overloaded intrinsic. You can use ``llvm.clmulr`` on any integer
+or vectors of integer elements.
+
+::
+
+      declare i16 @llvm.clmulr.i16(i16 %a, i16 %b)
+      declare i32 @llvm.clmulr.i32(i32 %a, i32 %b)
+      declare i64 @llvm.clmulr.i64(i64 %a, i64 %b)
+      declare <4 x i32> @llvm.clmulr.v4i32(<4 x i32> %a, <4 x i32> %b)
+
+Overview:
+"""""""""
+
+The '``llvm.clmulr``' family of intrinsic functions performs reversed
+carry-less multiplication on the two arguments.
+
+Arguments:
+""""""""""
+
+The arguments may be any integer type or vector of integer type. Both arguments
+and result must have the same type.
+
+Semantics:
+""""""""""
+
+The '``llvm.clmulr``' intrinsic computes reversed carry-less multiply of its
+arguments. The vector variants operate lane-wise.
+
+.. code-block:: text
+
+      clmulr(%a, %b) = bitreverse(clmul(bitreverse(%a), bitreverse(%b)))
----------------
pfusik wrote:

> IMO, we should probably only have `clmul`. LLVM doesn't have `mulh` (multiply and return the high bits), and it seems reasonable that backends can implement codegen to match the reverses to transform any of the clmul* variants into whatever the hardware requires. The reason that we need intrinsics at all for things like cluml is that matching loops is really complicated, and there's countless ways to represent the loops over bits that impliment a clmul (or a regular multiplication for that matter), but the representation of `clmulh` and `clmulr` are relatively restricted.

I agree should only have `clmul`. The other two should be selected for RISC-V by pattern matching.

    clmulh(X, Y) == clmulr(X, Y) >> 1 == clmul(zext(X), zext(Y)) >> BW
    clmulr(X, Y) == clmul(zext(X), zext(Y)) >> (BW-1) == bitreverse(clmul(bitreverse(X), bitreverse(Y)))

https://github.com/llvm/llvm-project/pull/168731