[LLVMdev] Adding masked vector load and store intrinsics

Fri Oct 24 06:36:05 PDT 2014

I wrote a loop with conditional load and store and measured performance on AVX2, where masking support is very basic, relatively to AVX-512.
I got 2x speedup with vpmaskmovd.

The maskmov instruction is slower than one vector load or store, but much faster than 8 scalar memory operations and 8 branches.

Usage of masked instructions on AVX-512 will give much more. There is no latency on target in comparison to the regular vector memop.

-           Elena

From: Das, Dibyendu [mailto:Dibyendu.Das at amd.com]
Sent: Friday, October 24, 2014 16:20
To: Demikhovsky, Elena; 'llvmdev at cs.uiuc.edu'
Cc: 'dag at cray.com'
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

This looks to be a reasonable proposal. However native instructions that support such masked ld/st may have a high latency ? Also, it would be good to state some workloads where this will have a positive impact.

-dibyendu

From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
Sent: Friday, October 24, 2014 06:24 AM Central Standard Time
To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> <llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>>
Cc: dag at cray.com<mailto:dag at cray.com> <dag at cray.com<mailto:dag at cray.com>>
Subject: [LLVMdev] Adding masked vector load and store intrinsics

Hi,

We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.

The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed.

  call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask)

  %data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask)

where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef).

Comments so far, before we dive into more details?

Thank you.

- Elena and Ayal

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/41aaba88/attachment.html>