<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: arial,helvetica,sans-serif; font-size: 10pt; color: #000000'><br><hr id="zwchr"><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b>From: </b>"Michael Kuperstein" <mkuper@google.com><br><b>To: </b>"Elena Demikhovsky" <elena.demikhovsky@intel.com><br><b>Cc: </b>"Hal Finkel" <hfinkel@anl.gov>, "Ayal Zaks" <ayal.zaks@intel.com>, "Adam Nemet (anemet@apple.com)" <anemet@apple.com>, "Sanjay Patel (spatel@rotateright.com)" <spatel@rotateright.com>, "Nadav Rotem" <nadav.rotem@me.com>, "llvm-dev" <llvm-dev@lists.llvm.org><br><b>Sent: </b>Monday, September 26, 2016 2:31:41 AM<br><b>Subject: </b>Re: RFC: New intrinsics masked.expandload and masked.compressstore<br><br><div dir="ltr">In theory, we could offload several things to such a target plug-in, I'm just not entirely sure we want to.<div><br></div><div>Two examples I can think of:</div><div><br></div><div>1) This could be a better interface for masked load/stores and gathers.</div><div><br></div><div id="DWT10089">2) Horizontal reductions. I tried writing yet-another-horizontals-as-first-class-citizens proposal a couple of months ago, and the main problem from the previous discussions about this was that there's no good common representation. E.g. should a horizontal add return a vector or a scalar, should it return the base type of the vector (assumes saturation) or a wider integer type, etc. With a plugin, we could have the vectorizer emit the right target intrinsic, instead of the crazy backend pattern-matching we have now. </div></div></blockquote>I don't think we want to offload either of these things to the targets to produce target-specific intrinsics - both are fairly generic. There's value in using IR and then pattern-matching the result later because it also means that we pick up cases where the same pattern comes from people using C-level vector intrinsics, other portable frontends, etc. We don't want every frontend wishing to emit a horizontal reduction to need to use target-specific intrinsics for different targets. Our vectorizer should not be special in this regard.<br><br>However, this does bring up another issue with our current cost model: it current estimates costs one instruction at a time, and so can't take advantage of lower costs associated with target instructions that have complicated behaviors (FMAs, saturating arithmetic, byte-swapping loads, etc.). This is a separate problem, in a sense, but perhaps there's a common solution.<br><br> -Hal<br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><div dir="ltr"><div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Sep 25, 2016 at 9:28 PM, Demikhovsky, Elena <span dir="ltr"><<a href="mailto:elena.demikhovsky@intel.com" target="_blank">elena.demikhovsky@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><span class=""><br>

  |<br>

  |Hi Elena,<br>

  |<br>

  |Technically speaking, this seems straightforward.<br>

  |<br>

  |I wonder, however, how target-independent this is in a practical<br>

  |sense; will there be an efficient lowering when targeting any other<br>

  |ISA? I don't want to get into the territory where, because the<br>

  |vectorizer is supposed to be architecture independent, we need to<br>

  |add target-independent intrinsics for all potentially-side-effect-<br>

  |carrying idioms (or just complicated idioms) we want the vectorizer to<br>

  |support on any target. Is there a way we can design the vectorizer so<br>

  |that the targets can plug in their own idiom recognition for these<br>

  |kinds of things, and then, via that interface, let the vectorizer produce<br>

  |the relevant target-dependent intrinsics?<br>

<br>

</span>Entering target specific plug-in in vectorizer may be a good idea. We need target specific pattern recognition and target specific implementation of “vectorizeMemoryInstruction”. (It may be more functionality in the future)<br>

TTI->checkAdditionalVectorizationOppotunities() - detects target specific patterns; X86 will find compress/expand and may be others<br>

TTI->vectorizeMemoryInstruction()  - handle only exotic target-specific cases<br>

<br>

Pros:<br>

It will allow us to implement all X86 specific solutions.<br>

The expandload and compresssrore intrinsics may be x86 specific, polymorphic:<br>

llvm.x86.masked.expandload()<br>

llvm.x86.masked.compressstore()<br>

<br>

Cons:<br>

<br>

TTI will need to deal with Loop Info, SCEVs and other loop analysis info that it does not have today. (I do not like this way)<br>

Or we'll need to introduce TLV - Target Loop Vectorizer - a new class that handles all target specific cases. This solution seems more reasonable, but too heavy just for compress/expand.<br>

Do you see any other target plug-in solution?<br>

<span class="HOEnZb"><font color="#888888"><br>

-Elena<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

  |<br>

  |Thanks again,<br>

  |Hal<br>

  |<br>

  |<hr id="zwchr"><br>

  |> From: "Elena Demikhovsky" <<a href="mailto:elena.demikhovsky@intel.com" target="_blank">elena.demikhovsky@intel.com</a>><br>

  |> To: "llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>

  |> Cc: "Ayal Zaks" <<a href="mailto:ayal.zaks@intel.com" target="_blank">ayal.zaks@intel.com</a>>, "Michael Kuperstein"<br>

  |<<a href="mailto:mkuper@google.com" target="_blank">mkuper@google.com</a>>, "Adam Nemet (<a href="mailto:anemet@apple.com" target="_blank">anemet@apple.com</a>)"<br>

  |> <<a href="mailto:anemet@apple.com" target="_blank">anemet@apple.com</a>>, "Hal Finkel (<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>)"<br>

  |<<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>, "Sanjay Patel (<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>)"<br>

  |> <<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>>, "Nadav Rotem"<br>

  |<<a href="mailto:nadav.rotem@me.com" target="_blank">nadav.rotem@me.com</a>><br>

  |> Sent: Monday, September 19, 2016 1:37:02 AM<br>

  |> Subject: RFC: New intrinsics masked.expandload and<br>

  |> masked.compressstore<br>

  |><br>

  |><br>

  |> Hi all,<br>

  |><br>

  |> AVX-512 ISA introduces new vector instructions VCOMPRESS and<br>

  |VEXPAND<br>

  |> in order to allow vectorization of the following loops with two<br>

  |> specific types of cross-iteration dependencies:<br>

  |><br>

  |> Compress:<br>

  |> for (int i=0; i<N; ++i)<br>

  |> If (t[i])<br>

  |> *A++ = expr;<br>

  |><br>

  |> Expand:<br>

  |> for (i=0; i<N; ++i)<br>

  |> If (t[i])<br>

  |> X[i] = *A++;<br>

  |> else<br>

  |> X[i] = PassThruV[i];<br>

  |><br>

  |> On this poster (<br>

  |> <a href="http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf" rel="noreferrer" target="_blank">http://llvm.org/devmtg/2013-11/slides/Demikhovsky-Poster.pdf</a> )<br>

  |you’ll<br>

  |> find depicted “compress” and “expand” patterns.<br>

  |><br>

  |> The RFC proposes to support this functionality by introducing two<br>

  |> intrinsics to LLVM IR:<br>

  |> llvm.masked.expandload.*<br>

  |> llvm.masked.compressstore.*<br>

  |><br>

  |> The syntax of these two intrinsics is similar to the syntax of<br>

  |> llvm.masked.load.* and masked.store.*, respectively, but the<br>

  |semantics<br>

  |> are different, matching the above patterns.<br>

  |><br>

  |> %res = call <16 x float> @llvm.masked.expandload.v16f32.p0f32<br>

  |(float*<br>

  |> %ptr, <16 x i1>%mask, <16 x float> %passthru) void<br>

  |> @llvm.masked.compressstore.v16f32.p0f32 (<16 x float> <value>,<br>

  |> float* <ptr>, <16 x i1> <mask>)<br>

  |><br>

  |> The arguments - %mask, %value and %passthru all have the same<br>

  |vector<br>

  |> length.<br>

  |> The underlying type of %ptr corresponds to the scalar type of the<br>

  |> vector value.<br>

  |> (In brief; the full syntax description will be provided in subsequent<br>

  |> full documentation.)<br>

  |><br>

  |> The intrinsics are planned to be target independent, similar to<br>

  |> masked.load/store/gather/scatter. They will be lowered effectively<br>

  |on<br>

  |> AVX-512 and scalarized on other targets, also akin to masked.*<br>

  |> intrinsics.<br>

  |> Loop vectorizer will query TTI about existence of effective support<br>

  |> for these intrinsics, and if provided will be able to handle loops<br>

  |> with such cross-iteration dependences.<br>

  |><br>

  |> The first step will include the full documentation and<br>

  |implementation<br>

  |> of CodeGen part.<br>

  |><br>

  |> An additional information about expand load (<br>

  |><br>

  |<a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=" rel="noreferrer" target="_blank">https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=</a><br>

  |exp<br>

  |> andload&techs=AVX_512<br>

  |> ) and compress store (<br>

  |><br>

  |<a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=" rel="noreferrer" target="_blank">https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=</a><br>

  |com<br>

  |> pressstore&techs=AVX_512<br>

  |> ) you also can find in the Intel Intrinsic Guide.<br>

  |><br>

  |><br>

  |>     * Elena<br>

  |><br>

  |> ---------------------------------------------------------------------<br>

  |> Intel Israel (74) Limited<br>

  |><br>

  |> This e-mail and any attachments may contain confidential material<br>

  |for<br>

  |> the sole use of the intended recipient(s). Any review or distribution<br>

  |> by others is strictly prohibited. If you are not the intended<br>

  |> recipient, please contact the sender and delete all copies.<br>

  |<br>

  |--<br>

  |Hal Finkel<br>

  |Lead, Compiler Technology and Programming Languages Leadership<br>

  |Computing Facility Argonne National Laboratory<br>

---------------------------------------------------------------------<br>

Intel Israel (74) Limited<br>

<br>

This e-mail and any attachments may contain confidential material for<br>

the sole use of the intended recipient(s). Any review or distribution<br>

by others is strictly prohibited. If you are not the intended<br>

recipient, please contact the sender and delete all copies.<br>

</div></div></blockquote></div><br></div>

</blockquote><br><br><br>-- <br><div><span name="x"></span>Hal Finkel<br>Lead, Compiler Technology and Programming Languages<br>Leadership Computing Facility<br>Argonne National Laboratory<span name="x"></span><br></div></div></body></html>