[llvm] [NVPTX] Add idp2a, idp4a intrinsics (PR #102763)

Mon Aug 12 14:11:51 PDT 2024

================
@@ -287,6 +287,62 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
 
 The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
 
+Arithmetic Intrinsics
+---------------------
+
+'``llvm.nvvm.idp2a``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i32 @llvm.nvvm.idp2a(i32 %a, i1 immarg %a.unsigned, i32 %b, i1 immarg %b.unsigned, i1 immarg %is.hi, i32 %c)
+
+Overview:
+"""""""""
+
+The '``llvm.nvvm.idp2a``' intrinsic performs a 2-element vector dot product
+followed by addition. It corresponds directly to the ``dp2a`` PTX instruction.
+
+Semantics:
+""""""""""
+
+The 32-bit value in ``%a`` is broken into 2 16-bit values which are either sign
+or zero extended, depending on the value of ``%a.unsigned``, to 32 bits. Two
+bytes are selected from ``%b``, if ``%is.hi`` is true, the most significant
+bytes are selected, otherwise the least significant bytes are selected. These
+bytes are each sign or zero extended, depending on ``%b.unsigned``. The dot
+product of these 2-element vectors is added to ``%c`` to produce the return.
----------------
Artem-B wrote:

It's indeed largely a naming problem in this case.

> this is what has been implemented internally and switching over will be somewhat painful.

Existence of non-public downstream implementation is a relatively weak argument for what's the right thing to do in LLVM itself. Considering that non-public changes did not go through the normal LLVM review process, some amount of changes is inevitable when the changes finally do face the public review. This is business as usual, IMO, and the maintenance price one pays for a diverged fork.

You may have existing users of your internal APIs that may be cumbersome to change. For that, the intrinsic auto-upgrade process may be helpful. It could be used to automatically translate internal intrinsic calls to the LLVM's.

https://github.com/llvm/llvm-project/blob/b368404dee8c341dc022a9e9a868f5a268e92033/llvm/lib/IR/AutoUpgrade.cpp#L10


https://github.com/llvm/llvm-project/pull/102763