[LLVMdev] Adding masked vector load and store intrinsics

Fri Oct 24 12:50:39 PDT 2014

"Das, Dibyendu" <Dibyendu.Das at amd.com> writes:

> Is there an example of such a workload ( lets say from the spec cpu
> 2006 harness or similar ) that you have in mind and the amount of gain
> expected ?

Literally nearly every code that has significant vector work in it.
Even if there is no control flow in the loop, masking allows the
compiler to more aggressively vectorize and rely on the masks to prevent
unsafe execution.

The amount of gain is highly code-dependent but my guess is that Elena's
example of 2x speedup is typical, maybe even on the lower end.

The capability of the vectorizer is the biggest factor.  Without masks,
the vectorizer cannot be as aggressive.  With masks, the vectorizer
still has to be written to be aggressive.  Ph.D. dissertations have been
written on the topic.  It's non-trivial work.

Masking is an enabling technology, not an end goal.

                         -David