[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

Mon Sep 26 14:08:14 PDT 2016

----- Original Message -----
> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Ayal Zaks" <ayal.zaks at intel.com>, "Michael Kuperstein" <mkuper at google.com>, "Adam Nemet (anemet at apple.com)"
> <anemet at apple.com>, "Sanjay Patel (spatel at rotateright.com)" <spatel at rotateright.com>, "Nadav Rotem"
> <nadav.rotem at me.com>, "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Monday, September 26, 2016 3:55:27 PM
> Subject: RE: RFC: New intrinsics masked.expandload and masked.compressstore
> 
> 
>   |
>   |How would this work in this case? The result would need to affect
>   |the
>   |legality and cost of the memory instruction. From your poster, it
>   |looks
>   |like we're talking about loops with constructs like this:
>   |
>   |for (i =0; i < N; i++) {
>   | if (topVal > b[i]) {
>   |   *dst = a[i];
>   |   dst++;
>   | }
>   |}
>   |
>   |is this loop vectorizable at all without these constructs?
> 
> Good question. Today it isn't. Theoretically yes if we'll know that
> only a small part of the loop has cross-iteration dependency or
> another issue. A loop may be vectorized and contain scalar pieces
> inside.
> But it requires full reconstruction of the cost model.
> 
>   | It looks like
>   |the target would need to analyze the PHI representing the store's
>   |address, assign the store some reasonable cost, and also provide
>   |some alternative SCEVs (perhaps lower and upper bounds) for use
>   |with the dependence checks?
> 
> First of all, this loop should pass legality check. Legality will
> need an additional effort in order to detect compress/expand pattern
> in a loop with cross-iteration dependency.
> Once the pattern is detected, we mark the "store" as "compressing
> store" and TTI will give a cost for compressing store.
>   |
>   |> X86 will find compress/expand and may be others
>   |
>   |What others might fit in here?
> The compress/expand are special patterns that will require a separate
> analysis. I thought about other X86 specific patterns that may be
> detected. Strided memory access with masks or arithmetic with
> saturation. But again, I'm not sure that constructing plug-in will
> not be an overkill in this case.

I'm fairly certainly that creating a plugin interface just for this would be overkill. Nevertheless, I found this discussion quite helpful. If we can't think of any other examples, I'm fine with this intrinsic as proposed.

Thanks again,
Hal

>   |
>   |> TTI->vectorizeMemoryInstruction()  - handle only exotic
>   |> target-specific cases
>   |>
>   |> Pros:
>   |> It will allow us to implement all X86 specific solutions.
>   |> The expandload and compresssrore intrinsics may be x86 specific,
>   |> polymorphic:
>   |> llvm.x86.masked.expandload()
>   |> llvm.x86.masked.compressstore()
>   |>
>   |> Cons:
>   |>
>   |> TTI will need to deal with Loop Info, SCEVs and other loop
>   |> analysis
>   |> info that it does not have today. (I do not like this way)
>   |
>   |Giving TTI the loop and other analyses, in itself, does not bother
>   |me.
>   |getUnrollingPreferences takes a Loop*. I'm more concerned about
>   |how cleanly we could integrate everything.
>   |
>   |> Or we'll need to introduce TLV - Target Loop Vectorizer - a new
>   |> class
>   |> that handles all target specific cases. This solution seems more
>   |> reasonable, but too heavy just for compress/expand.
>   |
>   |I don't see how this would work without duplicating a lot of the
>   |logic
>   |in the vectorizer (unless it is really just doing loop-idiom
>   |recognition,
>   |in which case none of this is really relevant). You'd want the
>   |cost-
>   |model using by the vectorizer, in general, to be integrated with
>   |whatever the target was providing.
>   |
>   |Thanks again,
>   |Hal
>   |
>   |> Do you see any other target plug-in solution?
>   |>
>   |> -Elena
>   |>
>   |>   |
>   |>   |Thanks again,
>   |>   |Hal
>   |>   |
>   |>   |----- Original Message -----
>   |>   |> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com>
>   |>   |> To: "llvm-dev" <llvm-dev at lists.llvm.org>
>   |>   |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>, "Michael Kuperstein"
>   |>   |<mkuper at google.com>, "Adam Nemet (anemet at apple.com)"
>   |>   |> <anemet at apple.com>, "Hal Finkel (hfinkel at anl.gov)"
>   |>   |<hfinkel at anl.gov>, "Sanjay Patel (spatel at rotateright.com)"
>   |>   |> <spatel at rotateright.com>, "Nadav Rotem"
>   |>   |<nadav.rotem at me.com>
>   |>   |> Sent: Monday, September 19, 2016 1:37:02 AM
>   |>   |> Subject: RFC: New intrinsics masked.expandload and
>   |>   |> masked.compressstore
>   |>   |>
>   |>   |>
>   |>   |> Hi all,
>   |>   |>
>   |>   |> AVX-512 ISA introduces new vector instructions VCOMPRESS
>   |and
>   |>   |VEXPAND
>   |>   |> in order to allow vectorization of the following loops with
>   |>   |> two
>   |>   |> specific types of cross-iteration dependencies:
>   |>   |>
>   |>   |> Compress:
>   |>   |> for (int i=0; i<N; ++i)
>   |>   |> If (t[i])
>   |>   |> *A++ = expr;
>   |>   |>
>   |>   |> Expand:
>   |>   |> for (i=0; i<N; ++i)
>   |>   |> If (t[i])
>   |>   |> X[i] = *A++;
>   |>   |> else
>   |>   |> X[i] = PassThruV[i];
>   |>   |>
>   |>   |> On this poster (
>   |>   |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-
>   |Poster.pdf )
>   |>   |you’ll
>   |>   |> find depicted “compress” and “expand” patterns.
>   |>   |>
>   |>   |> The RFC proposes to support this functionality by
>   |>   |> introducing
>   |>   |> two
>   |>   |> intrinsics to LLVM IR:
>   |>   |> llvm.masked.expandload.*
>   |>   |> llvm.masked.compressstore.*
>   |>   |>
>   |>   |> The syntax of these two intrinsics is similar to the syntax
>   |>   |> of
>   |>   |> llvm.masked.load.* and masked.store.*, respectively, but
>   |>   |> the
>   |>   |semantics
>   |>   |> are different, matching the above patterns.
>   |>   |>
>   |>   |> %res = call <16 x float>
>   |>   |> @llvm.masked.expandload.v16f32.p0f32
>   |>   |(float*
>   |>   |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
>   |>   |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
>   |<value>,
>   |>   |> float* <ptr>, <16 x i1> <mask>)
>   |>   |>
>   |>   |> The arguments - %mask, %value and %passthru all have the
>   |same
>   |>   |vector
>   |>   |> length.
>   |>   |> The underlying type of %ptr corresponds to the scalar type
>   |>   |> of
>   |>   |> the
>   |>   |> vector value.
>   |>   |> (In brief; the full syntax description will be provided in
>   |>   |> subsequent
>   |>   |> full documentation.)
>   |>   |>
>   |>   |> The intrinsics are planned to be target independent,
>   |>   |> similar to
>   |>   |> masked.load/store/gather/scatter. They will be lowered
>   |>   |> effectively
>   |>   |on
>   |>   |> AVX-512 and scalarized on other targets, also akin to
>   |>   |> masked.*
>   |>   |> intrinsics.
>   |>   |> Loop vectorizer will query TTI about existence of effective
>   |>   |> support
>   |>   |> for these intrinsics, and if provided will be able to
>   |>   |> handle
>   |>   |> loops
>   |>   |> with such cross-iteration dependences.
>   |>   |>
>   |>   |> The first step will include the full documentation and
>   |>   |implementation
>   |>   |> of CodeGen part.
>   |>   |>
>   |>   |> An additional information about expand load (
>   |>   |>
>   |>
>   ||https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text
>   |=
>   |>   |exp
>   |>   |> andload&techs=AVX_512
>   |>   |> ) and compress store (
>   |>   |>
>   |>
>   ||https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text
>   |=
>   |>   |com
>   |>   |> pressstore&techs=AVX_512
>   |>   |> ) you also can find in the Intel Intrinsic Guide.
>   |>   |>
>   |>   |>
>   |>   |>     * Elena
>   |>   |>
>   |>   |> ---------------------------------------------------------------------
>   |>   |> Intel Israel (74) Limited
>   |>   |>
>   |>   |> This e-mail and any attachments may contain confidential
>   |>   |> material
>   |>   |for
>   |>   |> the sole use of the intended recipient(s). Any review or
>   |>   |> distribution
>   |>   |> by others is strictly prohibited. If you are not the
>   |>   |> intended
>   |>   |> recipient, please contact the sender and delete all copies.
>   |>   |
>   |>   |--
>   |>   |Hal Finkel
>   |>   |Lead, Compiler Technology and Programming Languages
>   |Leadership
>   |>   |Computing Facility Argonne National Laboratory
>   |> ---------------------------------------------------------------------
>   |> Intel Israel (74) Limited
>   |>
>   |> This e-mail and any attachments may contain confidential
>   |> material
>   |for
>   |> the sole use of the intended recipient(s). Any review or
>   |> distribution
>   |> by others is strictly prohibited. If you are not the intended
>   |> recipient, please contact the sender and delete all copies.
>   |>
>   |
>   |--
>   |Hal Finkel
>   |Lead, Compiler Technology and Programming Languages Leadership
>   |Computing Facility Argonne National Laboratory
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> 

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory