[LLVMdev] Adding masked vector load and store intrinsics

This looks to be a reasonable proposal. However native instructions that support such masked ld/st may have a high latency ? Also, it would be good to state some workloads where this will have a positive impact.


We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.

The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed.

  call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask)

  %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask)

where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef).

Comments so far, before we dive into more details?

Thank you.

- Elena and Ayal

