[llvm] [NVPTX] Add idp2a, idp4a intrinsics (PR #102763)

Tue Aug 13 14:41:31 PDT 2024

================
@@ -287,6 +287,62 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
 
 The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
 
+Arithmetic Intrinsics
+---------------------
+
+'``llvm.nvvm.idp2a``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+    declare i32 @llvm.nvvm.idp2a(i32 %a, i1 immarg %a.unsigned, i32 %b, i1 immarg %b.unsigned, i1 immarg %is.hi, i32 %c)
+
+Overview:
+"""""""""
+
+The '``llvm.nvvm.idp2a``' intrinsic performs a 2-element vector dot product
+followed by addition. It corresponds directly to the ``dp2a`` PTX instruction.
+
+Semantics:
+""""""""""
+
+The 32-bit value in ``%a`` is broken into 2 16-bit values which are either sign
+or zero extended, depending on the value of ``%a.unsigned``, to 32 bits. Two
+bytes are selected from ``%b``, if ``%is.hi`` is true, the most significant
+bytes are selected, otherwise the least significant bytes are selected. These
+bytes are each sign or zero extended, depending on ``%b.unsigned``. The dot
+product of these 2-element vectors is added to ``%c`` to produce the return.
----------------
Artem-B wrote:

> if we or others see a similar need on > 2 operands how would we design them?

One way to think on this is that most of the intrinsics fall into two categories:
- they provide a well defined functionality (e.g. math functions)
- they map to specific instructions.

The defined-functionality intrinsics usually define the intrinsic signature, so there's usually not much wiggle room.

The per-instruction intrinsics are probably better served by direct 1:1 intrinsic:instruction map, with intrinsic name roughly matching the instruction syntax. It works OK for 'normal' targets, but PTX is... unusual.

1:1 mapping is usually easily handled by tablegen, and LLVM can deal with a relatively large number of variants. LLVM already has ~11K intrinsics, and we already have multi-parameter variants for various forms of *MMA instructions that have a lot of different variants to accommodate different geometries and types. So, if you have only a handful of variables, 1:1 map is fine. 

If there's a good reason to push instruction variant selection for the sake of simplifying instruction variant selection, it should be discussed on a case by case basis. Often it does not provide as much benefit as it may appear. 

In cases like `dp*` instruction here we end up providing N extra immediate-only arguments, and it does not really save us anything -- instead of explicitly picking among N intrinsic names directly mapped to an instruction, we'd be picking a single intrinsic, and then have to choose from N instruction variants anyways. That just adds yet another indirection layer for the end user to be aware of. 

https://github.com/llvm/llvm-project/pull/102763