[llvm] [NVPTX] Add idp2a, idp4a intrinsics (PR #102763)

Tue Aug 13 11:07:28 PDT 2024

================
@@ -287,6 +287,62 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
 
 The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
 
+Arithmetic Intrinsics
+---------------------
+
+'``llvm.nvvm.idp2a``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i32 @llvm.nvvm.idp2a(i32 %a, i1 immarg %a.unsigned, i32 %b, i1 immarg %b.unsigned, i1 immarg %is.hi, i32 %c)
+
+Overview:
+"""""""""
+
+The '``llvm.nvvm.idp2a``' intrinsic performs a 2-element vector dot product
+followed by addition. It corresponds directly to the ``dp2a`` PTX instruction.
+
+Semantics:
+""""""""""
+
+The 32-bit value in ``%a`` is broken into 2 16-bit values which are either sign
+or zero extended, depending on the value of ``%a.unsigned``, to 32 bits. Two
+bytes are selected from ``%b``, if ``%is.hi`` is true, the most significant
+bytes are selected, otherwise the least significant bytes are selected. These
+bytes are each sign or zero extended, depending on ``%b.unsigned``. The dot
+product of these 2-element vectors is added to ``%c`` to produce the return.
----------------
Artem-B wrote:

> However, when there are multiple operands with different signedness, does it scale to encode all combinations in the intrinsic name? 

It depends, I guess. Naming is hard, Good/consistent API design complicates things further. I do not think we have any standard set of rules on that. Yes, it's possible that if there are more type variants we may end up with too many intrinsics, or would have to resort to other options. If you have a specific example in mind, we could discuss how it could be handled.

> 4 operands and signedness, would you still recommend to encode all combinations in the name?

If it could all be offloaded to tablegen w/o complicating things, then my answer would probably be "yes". It does not mean that it's the final word on the matter. Other LLVM developers may have different opinions.

If there is existing code in LLVM using a different approach, I would consider the design choices and the arguments raised during review of that code. 

You may also ask on LLVM's discourse and see if wider LLVM community would offer any suggestions.



https://github.com/llvm/llvm-project/pull/102763