[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

Mon Sep 26 12:55:15 PDT 2016

----- Original Message -----

> From: "Michael Kuperstein" <mkuper at google.com>
> To: "Elena Demikhovsky" <elena.demikhovsky at intel.com>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Ayal Zaks"
> <ayal.zaks at intel.com>, "Adam Nemet (anemet at apple.com)"
> <anemet at apple.com>, "Sanjay Patel (spatel at rotateright.com)"
> <spatel at rotateright.com>, "Nadav Rotem" <nadav.rotem at me.com>,
> "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Monday, September 26, 2016 2:31:41 AM
> Subject: Re: RFC: New intrinsics masked.expandload and
> masked.compressstore

> In theory, we could offload several things to such a target plug-in,
> I'm just not entirely sure we want to.

> Two examples I can think of:

> 1) This could be a better interface for masked load/stores and
> gathers.

> 2) Horizontal reductions. I tried writing
> yet-another-horizontals-as-first-class-citizens proposal a couple of
> months ago, and the main problem from the previous discussions about
> this was that there's no good common representation. E.g. should a
> horizontal add return a vector or a scalar, should it return the
> base type of the vector (assumes saturation) or a wider integer
> type, etc. With a plugin, we could have the vectorizer emit the
> right target intrinsic, instead of the crazy backend
> pattern-matching we have now.
I don't think we want to offload either of these things to the targets to produce target-specific intrinsics - both are fairly generic. There's value in using IR and then pattern-matching the result later because it also means that we pick up cases where the same pattern comes from people using C-level vector intrinsics, other portable frontends, etc. We don't want every frontend wishing to emit a horizontal reduction to need to use target-specific intrinsics for different targets. Our vectorizer should not be special in this regard. 

However, this does bring up another issue with our current cost model: it current estimates costs one instruction at a time, and so can't take advantage of lower costs associated with target instructions that have complicated behaviors (FMAs, saturating arithmetic, byte-swapping loads, etc.). This is a separate problem, in a sense, but perhaps there's a common solution. 

-Hal 

> On Sun, Sep 25, 2016 at 9:28 PM, Demikhovsky, Elena <
> elena.demikhovsky at intel.com > wrote:

> > |
> 
> > |Hi Elena,
> 
> > |
> 
> > |Technically speaking, this seems straightforward.
> 
> > |
> 
> > |I wonder, however, how target-independent this is in a practical
> 
> > |sense; will there be an efficient lowering when targeting any
> > |other
> 
> > |ISA? I don't want to get into the territory where, because the
> 
> > |vectorizer is supposed to be architecture independent, we need to
> 
> > |add target-independent intrinsics for all potentially-side-effect-
> 
> > |carrying idioms (or just complicated idioms) we want the
> > |vectorizer
> > |to
> 
> > |support on any target. Is there a way we can design the vectorizer
> > |so
> 
> > |that the targets can plug in their own idiom recognition for these
> 
> > |kinds of things, and then, via that interface, let the vectorizer
> > |produce
> 
> > |the relevant target-dependent intrinsics?
> 

> > Entering target specific plug-in in vectorizer may be a good idea.
> > We
> > need target specific pattern recognition and target specific
> > implementation of “vectorizeMemoryInstruction”. (It may be more
> > functionality in the future)
> 
> > TTI->checkAdditionalVectorizationOppotunities() - detects target
> > specific patterns; X86 will find compress/expand and may be others
> 
> > TTI->vectorizeMemoryInstruction() - handle only exotic
> > target-specific cases
> 

> > Pros:
> 
> > It will allow us to implement all X86 specific solutions.
> 
> > The expandload and compresssrore intrinsics may be x86 specific,
> > polymorphic:
> 
> > llvm.x86.masked.expandload()
> 
> > llvm.x86.masked.compressstore()
> 

> > Cons:
> 

> > TTI will need to deal with Loop Info, SCEVs and other loop analysis
> > info that it does not have today. (I do not like this way)
> 
> > Or we'll need to introduce TLV - Target Loop Vectorizer - a new
> > class
> > that handles all target specific cases. This solution seems more
> > reasonable, but too heavy just for compress/expand.
> 
> > Do you see any other target plug-in solution?
> 

> > -Elena
> 

> > |
> 
> > |Thanks again,
> 
> > |Hal
> 
> > |
> 
> > | ----- Original Message -----
> 

> > |> From: "Elena Demikhovsky" < elena.demikhovsky at intel.com >
> 
> > |> To: "llvm-dev" < llvm-dev at lists.llvm.org >
> 
> > |> Cc: "Ayal Zaks" < ayal.zaks at intel.com >, "Michael Kuperstein"
> 
> > |< mkuper at google.com >, "Adam Nemet ( anemet at apple.com )"
> 
> > |> < anemet at apple.com >, "Hal Finkel ( hfinkel at anl.gov )"
> 
> > |< hfinkel at anl.gov >, "Sanjay Patel ( spatel at rotateright.com )"
> 
> > |> < spatel at rotateright.com >, "Nadav Rotem"
> 
> > |< nadav.rotem at me.com >
> 
> > |> Sent: Monday, September 19, 2016 1:37:02 AM
> 
> > |> Subject: RFC: New intrinsics masked.expandload and
> 
> > |> masked.compressstore
> 
> > |>
> 
> > |>
> 
> > |> Hi all,
> 
> > |>
> 
> > |> AVX-512 ISA introduces new vector instructions VCOMPRESS and
> 
> > |VEXPAND
> 
> > |> in order to allow vectorization of the following loops with two
> 
> > |> specific types of cross-iteration dependencies:
> 
> > |>
> 
> > |> Compress:
> 
> > |> for (int i=0; i<N; ++i)
> 
> > |> If (t[i])
> 
> > |> *A++ = expr;
> 
> > |>
> 
> > |> Expand:
> 
> > |> for (i=0; i<N; ++i)
> 
> > |> If (t[i])
> 
> > |> X[i] = *A++;
> 
> > |> else
> 
> > |> X[i] = PassThruV[i];
> 
> > |>
> 
> > |> On this poster (
> 
> > |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf )
> 
> > |you’ll
> 
> > |> find depicted “compress” and “expand” patterns.
> 
> > |>
> 
> > |> The RFC proposes to support this functionality by introducing
> > |> two
> 
> > |> intrinsics to LLVM IR:
> 
> > |> llvm.masked.expandload.*
> 
> > |> llvm.masked.compressstore.*
> 
> > |>
> 
> > |> The syntax of these two intrinsics is similar to the syntax of
> 
> > |> llvm.masked.load.* and masked.store.*, respectively, but the
> 
> > |semantics
> 
> > |> are different, matching the above patterns.
> 
> > |>
> 
> > |> %res = call <16 x float> @llvm.masked.expandload.v16f32.p0f32
> 
> > |(float*
> 
> > |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
> 
> > |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float> <value>,
> 
> > |> float* <ptr>, <16 x i1> <mask>)
> 
> > |>
> 
> > |> The arguments - %mask, %value and %passthru all have the same
> 
> > |vector
> 
> > |> length.
> 
> > |> The underlying type of %ptr corresponds to the scalar type of
> > |> the
> 
> > |> vector value.
> 
> > |> (In brief; the full syntax description will be provided in
> > |> subsequent
> 
> > |> full documentation.)
> 
> > |>
> 
> > |> The intrinsics are planned to be target independent, similar to
> 
> > |> masked.load/store/gather/scatter. They will be lowered
> > |> effectively
> 
> > |on
> 
> > |> AVX-512 and scalarized on other targets, also akin to masked.*
> 
> > |> intrinsics.
> 
> > |> Loop vectorizer will query TTI about existence of effective
> > |> support
> 
> > |> for these intrinsics, and if provided will be able to handle
> > |> loops
> 
> > |> with such cross-iteration dependences.
> 
> > |>
> 
> > |> The first step will include the full documentation and
> 
> > |implementation
> 
> > |> of CodeGen part.
> 
> > |>
> 
> > |> An additional information about expand load (
> 
> > |>
> 
> > | https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=
> 
> > |exp
> 
> > |> andload&techs=AVX_512
> 
> > |> ) and compress store (
> 
> > |>
> 
> > | https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=
> 
> > |com
> 
> > |> pressstore&techs=AVX_512
> 
> > |> ) you also can find in the Intel Intrinsic Guide.
> 
> > |>
> 
> > |>
> 
> > |> * Elena
> 
> > |>
> 
> > |> ---------------------------------------------------------------------
> 
> > |> Intel Israel (74) Limited
> 
> > |>
> 
> > |> This e-mail and any attachments may contain confidential
> > |> material
> 
> > |for
> 
> > |> the sole use of the intended recipient(s). Any review or
> > |> distribution
> 
> > |> by others is strictly prohibited. If you are not the intended
> 
> > |> recipient, please contact the sender and delete all copies.
> 
> > |
> 
> > |--
> 
> > |Hal Finkel
> 
> > |Lead, Compiler Technology and Programming Languages Leadership
> 
> > |Computing Facility Argonne National Laboratory
> 
> > ---------------------------------------------------------------------
> 
> > Intel Israel (74) Limited
> 

> > This e-mail and any attachments may contain confidential material
> > for
> 
> > the sole use of the intended recipient(s). Any review or
> > distribution
> 
> > by others is strictly prohibited. If you are not the intended
> 
> > recipient, please contact the sender and delete all copies.
> 

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160926/87a740e5/attachment.html>