[llvm] [Intrinsics][AArch64] Add intrinsic to mask off aliasing vector lanes (PR #117007)
Sander de Smalen via llvm-commits
llvm-commits at lists.llvm.org
Mon Feb 3 06:16:47 PST 2025
================
@@ -23624,6 +23624,92 @@ Examples:
%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %elem0, i64 429)
%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %active.lane.mask, <4 x i32> poison)
+.. _int_experimental_get_noalias_lane_mask:
+
+'``llvm.experimental.get.noalias.lane.mask.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <4 x i1> @llvm.experimental.get.noalias.lane.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize, i1 immarg %writeAfterRead)
----------------
sdesmalen-arm wrote:
I had to dig a bit into the semantics of this intrinsic and the corresponding whilerw/whilewr instructions for AArch64 to get a better understanding. It seems there are 6 different cases to distinguish:
1) read-after-write, positive distance
2) write-after-read, positive distance
3) read-after-write, negative distance
4) write-after-read, negative distance
5) write-after-read, zero distance
6) read-after-write, zero distance
(1) and (2) would change the original order of read/write operations when vectorising with a sufficiently large VF, hence the reason for wanting a mask to know how many lanes are safe to vectorise.
(3) would not change the order of read/write operations, but may result in a performance penalty, because the write must have completed fully before being able to read.
(4), (5) and (6) are no problem for vectorization, because none of those would result in read/write operations being reordered when being vectorized.
I can see that `%writeAfterRead` was added to handle case (3), but to me the semantics of it are not very intuitive, particularly because the parameter only makes sense depending on the context in which it is used (i.e. whether it is used in the context of a read-write, or write-read sequence) rather than an intuitive operation on a vector of values that returns a result vector.
Rather than encoding that with an extra immediate parameter, I think it's worth considering a slightly different intrinsic that returns an 'alias dependence vector', rather than returning a boolean mask. The dependence can either be a negative value (for negative distance), a positive value (for positive distance) or `0` (which would either mean a dependence distance of `0` or `no dependence`). It's then possible to use an `icmp` instruction to generate a no-alias mask for the case you're interested in, i.e. `icmp le ... zeroinitializer` for cases (1) and (2), or `icmp eq ... zeroinitializer` for case (3).
We could then also remove the `%elementSize` parameter and instead encode this information in the return type. For example:
```
%intra.vector.deps = <4 x i32> llvm.experimental.alias.vector.v4i32(ptr %a, ptr %b)
%mask = icmp eq <4 x i32> %intra.vector.deps, zeroinitializer
```
This would generate a no-alias mask for an element size of 4 (`i32`), and for AArch64 would map to a `whilerw` instruction.
I would suggest not prescribing what exactly the 'positive' and 'negative' values must be, but only that they must be values that describe the dependence direction and fit the element type (which must be bigger than 1 bit, in order to describe both a positive, negative and zero-value). If for example we say that it must describe the dependence distance (signed value), that might put unnecessary restrictions on the minimum size of the element for targets that have no architectural maximum VF.
https://github.com/llvm/llvm-project/pull/117007
More information about the llvm-commits
mailing list