[llvm] [NVPTX] Add idp2a, idp4a intrinsics (PR #102763)

Tue Aug 13 12:45:20 PDT 2024

================
@@ -287,6 +287,62 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
 
 The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
 
+Arithmetic Intrinsics
+---------------------
+
+'``llvm.nvvm.idp2a``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i32 @llvm.nvvm.idp2a(i32 %a, i1 immarg %a.unsigned, i32 %b, i1 immarg %b.unsigned, i1 immarg %is.hi, i32 %c)
+
+Overview:
+"""""""""
+
+The '``llvm.nvvm.idp2a``' intrinsic performs a 2-element vector dot product
+followed by addition. It corresponds directly to the ``dp2a`` PTX instruction.
+
+Semantics:
+""""""""""
+
+The 32-bit value in ``%a`` is broken into 2 16-bit values which are either sign
+or zero extended, depending on the value of ``%a.unsigned``, to 32 bits. Two
+bytes are selected from ``%b``, if ``%is.hi`` is true, the most significant
+bytes are selected, otherwise the least significant bytes are selected. These
+bytes are each sign or zero extended, depending on ``%b.unsigned``. The dot
+product of these 2-element vectors is added to ``%c`` to produce the return.
----------------
gchak wrote:

Yes, keeping consistency with this suggested design is making me think twice on this. My question was a general one - if we or others see a similar need on > 2 operands how would we design them? In this context, I am also thinking about the design beyond their ease of implementation - does the replication of a particular intrinsic operation beyond a certain number seem reasonable for this operand signedness representation or not?

I also noticed another issue - sometimes we also need to express signedness on the return type. I cannot share the particular intrinsic yet, unfortunately. The idp* intrinsics also have a return type but we don't represent their signedness. However, on some others we might need them on the destination also.

This makes me wonder if the earlier design by @AlexMaclean keeps things more explicit and clear.

Would love your thoughts here, but these factors are keeping me inclined towards the previous design. :)


https://github.com/llvm/llvm-project/pull/102763