[llvm-dev] RFC: New intrinsics masked.expandload and masked.compressstore

Tue Sep 27 11:14:55 PDT 2016

Thank you. I have a BOF slot on LLVM dev meeting where I'll open discussion about intrinsics as a form of vectorizer output.

-  Elena

  |-----Original Message-----
  |From: Hal Finkel [mailto:hfinkel at anl.gov]
  |Sent: Tuesday, September 27, 2016 00:08
  |To: Demikhovsky, Elena <elena.demikhovsky at intel.com>
  |Cc: Zaks, Ayal <ayal.zaks at intel.com>; Michael Kuperstein
  |<mkuper at google.com>; Adam Nemet (anemet at apple.com)
  |<anemet at apple.com>; Sanjay Patel (spatel at rotateright.com)
  |<spatel at rotateright.com>; Nadav Rotem <nadav.rotem at me.com>;
  |llvm-dev <llvm-dev at lists.llvm.org>
  |Subject: Re: RFC: New intrinsics masked.expandload and
  |masked.compressstore
  |
  |----- Original Message -----
  |> From: "Elena Demikhovsky" <elena.demikhovsky at intel.com>
  |> To: "Hal Finkel" <hfinkel at anl.gov>
  |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>, "Michael Kuperstein"
  |<mkuper at google.com>, "Adam Nemet (anemet at apple.com)"
  |> <anemet at apple.com>, "Sanjay Patel (spatel at rotateright.com)"
  |<spatel at rotateright.com>, "Nadav Rotem"
  |> <nadav.rotem at me.com>, "llvm-dev" <llvm-dev at lists.llvm.org>
  |> Sent: Monday, September 26, 2016 3:55:27 PM
  |> Subject: RE: RFC: New intrinsics masked.expandload and
  |> masked.compressstore
  |>
  |>
  |>   |
  |>   |How would this work in this case? The result would need to affect
  |>   |the
  |>   |legality and cost of the memory instruction. From your poster, it
  |>   |looks
  |>   |like we're talking about loops with constructs like this:
  |>   |
  |>   |for (i =0; i < N; i++) {
  |>   | if (topVal > b[i]) {
  |>   |   *dst = a[i];
  |>   |   dst++;
  |>   | }
  |>   |}
  |>   |
  |>   |is this loop vectorizable at all without these constructs?
  |>
  |> Good question. Today it isn't. Theoretically yes if we'll know that
  |> only a small part of the loop has cross-iteration dependency or
  |> another issue. A loop may be vectorized and contain scalar pieces
  |> inside.
  |> But it requires full reconstruction of the cost model.
  |>
  |>   | It looks like
  |>   |the target would need to analyze the PHI representing the store's
  |>   |address, assign the store some reasonable cost, and also provide
  |>   |some alternative SCEVs (perhaps lower and upper bounds) for
  |use
  |>   |with the dependence checks?
  |>
  |> First of all, this loop should pass legality check. Legality will need
  |> an additional effort in order to detect compress/expand pattern in a
  |> loop with cross-iteration dependency.
  |> Once the pattern is detected, we mark the "store" as "compressing
  |> store" and TTI will give a cost for compressing store.
  |>   |
  |>   |> X86 will find compress/expand and may be others
  |>   |
  |>   |What others might fit in here?
  |> The compress/expand are special patterns that will require a
  |separate
  |> analysis. I thought about other X86 specific patterns that may be
  |> detected. Strided memory access with masks or arithmetic with
  |> saturation. But again, I'm not sure that constructing plug-in will not
  |> be an overkill in this case.
  |
  |I'm fairly certainly that creating a plugin interface just for this would
  |be overkill. Nevertheless, I found this discussion quite helpful. If we
  |can't think of any other examples, I'm fine with this intrinsic as
  |proposed.
  |
  |Thanks again,
  |Hal
  |
  |>   |
  |>   |> TTI->vectorizeMemoryInstruction()  - handle only exotic
  |>   |> target-specific cases
  |>   |>
  |>   |> Pros:
  |>   |> It will allow us to implement all X86 specific solutions.
  |>   |> The expandload and compresssrore intrinsics may be x86
  |specific,
  |>   |> polymorphic:
  |>   |> llvm.x86.masked.expandload()
  |>   |> llvm.x86.masked.compressstore()
  |>   |>
  |>   |> Cons:
  |>   |>
  |>   |> TTI will need to deal with Loop Info, SCEVs and other loop
  |>   |> analysis
  |>   |> info that it does not have today. (I do not like this way)
  |>   |
  |>   |Giving TTI the loop and other analyses, in itself, does not bother
  |>   |me.
  |>   |getUnrollingPreferences takes a Loop*. I'm more concerned
  |about
  |>   |how cleanly we could integrate everything.
  |>   |
  |>   |> Or we'll need to introduce TLV - Target Loop Vectorizer - a new
  |>   |> class
  |>   |> that handles all target specific cases. This solution seems more
  |>   |> reasonable, but too heavy just for compress/expand.
  |>   |
  |>   |I don't see how this would work without duplicating a lot of the
  |>   |logic
  |>   |in the vectorizer (unless it is really just doing loop-idiom
  |>   |recognition,
  |>   |in which case none of this is really relevant). You'd want the
  |>   |cost-
  |>   |model using by the vectorizer, in general, to be integrated with
  |>   |whatever the target was providing.
  |>   |
  |>   |Thanks again,
  |>   |Hal
  |>   |
  |>   |> Do you see any other target plug-in solution?
  |>   |>
  |>   |> -Elena
  |>   |>
  |>   |>   |
  |>   |>   |Thanks again,
  |>   |>   |Hal
  |>   |>   |
  |>   |>   |----- Original Message -----
  |>   |>   |> From: "Elena Demikhovsky"
  |<elena.demikhovsky at intel.com>
  |>   |>   |> To: "llvm-dev" <llvm-dev at lists.llvm.org>
  |>   |>   |> Cc: "Ayal Zaks" <ayal.zaks at intel.com>, "Michael
  |Kuperstein"
  |>   |>   |<mkuper at google.com>, "Adam Nemet
  |(anemet at apple.com)"
  |>   |>   |> <anemet at apple.com>, "Hal Finkel (hfinkel at anl.gov)"
  |>   |>   |<hfinkel at anl.gov>, "Sanjay Patel (spatel at rotateright.com)"
  |>   |>   |> <spatel at rotateright.com>, "Nadav Rotem"
  |>   |>   |<nadav.rotem at me.com>
  |>   |>   |> Sent: Monday, September 19, 2016 1:37:02 AM
  |>   |>   |> Subject: RFC: New intrinsics masked.expandload and
  |>   |>   |> masked.compressstore
  |>   |>   |>
  |>   |>   |>
  |>   |>   |> Hi all,
  |>   |>   |>
  |>   |>   |> AVX-512 ISA introduces new vector instructions
  |VCOMPRESS
  |>   |and
  |>   |>   |VEXPAND
  |>   |>   |> in order to allow vectorization of the following loops with
  |>   |>   |> two
  |>   |>   |> specific types of cross-iteration dependencies:
  |>   |>   |>
  |>   |>   |> Compress:
  |>   |>   |> for (int i=0; i<N; ++i)
  |>   |>   |> If (t[i])
  |>   |>   |> *A++ = expr;
  |>   |>   |>
  |>   |>   |> Expand:
  |>   |>   |> for (i=0; i<N; ++i)
  |>   |>   |> If (t[i])
  |>   |>   |> X[i] = *A++;
  |>   |>   |> else
  |>   |>   |> X[i] = PassThruV[i];
  |>   |>   |>
  |>   |>   |> On this poster (
  |>   |>   |> http://llvm.org/devmtg/2013-11/slides/Demikhovsky-
  |>   |Poster.pdf )
  |>   |>   |you’ll
  |>   |>   |> find depicted “compress” and “expand” patterns.
  |>   |>   |>
  |>   |>   |> The RFC proposes to support this functionality by
  |>   |>   |> introducing
  |>   |>   |> two
  |>   |>   |> intrinsics to LLVM IR:
  |>   |>   |> llvm.masked.expandload.*
  |>   |>   |> llvm.masked.compressstore.*
  |>   |>   |>
  |>   |>   |> The syntax of these two intrinsics is similar to the syntax
  |>   |>   |> of
  |>   |>   |> llvm.masked.load.* and masked.store.*, respectively, but
  |>   |>   |> the
  |>   |>   |semantics
  |>   |>   |> are different, matching the above patterns.
  |>   |>   |>
  |>   |>   |> %res = call <16 x float>
  |>   |>   |> @llvm.masked.expandload.v16f32.p0f32
  |>   |>   |(float*
  |>   |>   |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void
  |>   |>   |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float>
  |>   |<value>,
  |>   |>   |> float* <ptr>, <16 x i1> <mask>)
  |>   |>   |>
  |>   |>   |> The arguments - %mask, %value and %passthru all have the
  |>   |same
  |>   |>   |vector
  |>   |>   |> length.
  |>   |>   |> The underlying type of %ptr corresponds to the scalar type
  |>   |>   |> of
  |>   |>   |> the
  |>   |>   |> vector value.
  |>   |>   |> (In brief; the full syntax description will be provided in
  |>   |>   |> subsequent
  |>   |>   |> full documentation.)
  |>   |>   |>
  |>   |>   |> The intrinsics are planned to be target independent,
  |>   |>   |> similar to
  |>   |>   |> masked.load/store/gather/scatter. They will be lowered
  |>   |>   |> effectively
  |>   |>   |on
  |>   |>   |> AVX-512 and scalarized on other targets, also akin to
  |>   |>   |> masked.*
  |>   |>   |> intrinsics.
  |>   |>   |> Loop vectorizer will query TTI about existence of effective
  |>   |>   |> support
  |>   |>   |> for these intrinsics, and if provided will be able to
  |>   |>   |> handle
  |>   |>   |> loops
  |>   |>   |> with such cross-iteration dependences.
  |>   |>   |>
  |>   |>   |> The first step will include the full documentation and
  |>   |>   |implementation
  |>   |>   |> of CodeGen part.
  |>   |>   |>
  |>   |>   |> An additional information about expand load (
  |>   |>   |>
  |>   |>
  |>
  |||https://software.intel.com/sites/landingpage/IntrinsicsGuide/#tex
  |t
  |>   |=
  |>   |>   |exp
  |>   |>   |> andload&techs=AVX_512
  |>   |>   |> ) and compress store (
  |>   |>   |>
  |>   |>
  |>
  |||https://software.intel.com/sites/landingpage/IntrinsicsGuide/#tex
  |t
  |>   |=
  |>   |>   |com
  |>   |>   |> pressstore&techs=AVX_512
  |>   |>   |> ) you also can find in the Intel Intrinsic Guide.
  |>   |>   |>
  |>   |>   |>
  |>   |>   |>     * Elena
  |>   |>   |>
  |>   |>   |> ---------------------------------------------------------------------
  |>   |>   |> Intel Israel (74) Limited
  |>   |>   |>
  |>   |>   |> This e-mail and any attachments may contain confidential
  |>   |>   |> material
  |>   |>   |for
  |>   |>   |> the sole use of the intended recipient(s). Any review or
  |>   |>   |> distribution
  |>   |>   |> by others is strictly prohibited. If you are not the
  |>   |>   |> intended
  |>   |>   |> recipient, please contact the sender and delete all copies.
  |>   |>   |
  |>   |>   |--
  |>   |>   |Hal Finkel
  |>   |>   |Lead, Compiler Technology and Programming Languages
  |>   |Leadership
  |>   |>   |Computing Facility Argonne National Laboratory
  |>   |> ---------------------------------------------------------------------
  |>   |> Intel Israel (74) Limited
  |>   |>
  |>   |> This e-mail and any attachments may contain confidential
  |>   |> material
  |>   |for
  |>   |> the sole use of the intended recipient(s). Any review or
  |>   |> distribution
  |>   |> by others is strictly prohibited. If you are not the intended
  |>   |> recipient, please contact the sender and delete all copies.
  |>   |>
  |>   |
  |>   |--
  |>   |Hal Finkel
  |>   |Lead, Compiler Technology and Programming Languages
  |Leadership
  |>   |Computing Facility Argonne National Laboratory
  |> ---------------------------------------------------------------------
  |> Intel Israel (74) Limited
  |>
  |> This e-mail and any attachments may contain confidential material
  |for
  |> the sole use of the intended recipient(s). Any review or distribution
  |> by others is strictly prohibited. If you are not the intended
  |> recipient, please contact the sender and delete all copies.
  |>
  |
  |--
  |Hal Finkel
  |Lead, Compiler Technology and Programming Languages Leadership
  |Computing Facility Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.