<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman",serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 90.0pt 72.0pt 90.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="EN-US" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">></span> In theory, it should also allow the auto-vectorizers to produce more efficient prologue/epilog loop code, but that has not been implemented yet AFAIK.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Prologues and Epilogues tend to use only one end of a vector, right? OTOH, checking both ends should help relax “all lanes” specialization as done in “Predicate

 Vectors If You Must”. :)<o:p></o:p></span></p>

<p class="MsoNormal"><a name="_MailEndCompose"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></a></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Note that if, in addition to the conditions below, the address is adequately aligned, condition “</span>1. both ends of vector are used<span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">”

 can be replaced by “Some lane of the vector is used”.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

<div style="border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm 4.0pt">

<div>

<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">

<p class="MsoNormal"><a name="_____replyseparator"></a><b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"> Sanjay Patel [mailto:spatel@rotateright.com]

<br>

<b>Sent:</b> Monday, April 04, 2016 20:11<br>

<b>To:</b> Zaks, Ayal <ayal.zaks@intel.com><br>

<b>Cc:</b> Nema, Ashutosh <Ashutosh.Nema@amd.com>; llvm-dev <llvm-dev@lists.llvm.org>; Demikhovsky, Elena <elena.demikhovsky@intel.com><br>

<b>Subject:</b> Re: [llvm-dev] masked-load endpoints optimization<o:p></o:p></span></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div>

<p class="MsoNormal" style="margin-bottom:12.0pt"><br>

<br>

On Sun, Apr 3, 2016 at 3:51 PM, Zaks, Ayal <<a href="mailto:ayal.zaks@intel.com">ayal.zaks@intel.com</a>> wrote:<br>

><br>

> < The real question I have is whether it is legal to read the extra memory, regardless of whether this is a masked load or something else.<br>

><br>

>  <br>

><br>

> If one is allowed to read from a given address, a reasonable(?) assumption is that the aligned cache-line containing this address can be read. This should help answer the question.<br>

<br>

<br>

I started another thread with this question to also include cfe-dev:<br>

<a href="http://lists.llvm.org/pipermail/llvm-dev/2016-March/096828.html">http://lists.llvm.org/pipermail/llvm-dev/2016-March/096828.html</a><br>

<br>

For reference, the necessary conditions to do the transform are at least this:<br>

<br>

1. both ends of vector are used<br>

2. vector is smaller than granularity of cacheline and memory protection on targeted architecture<br>

3. not FP, or arch doesn’t raise flags on FP loads (most)<br>

4. not volatile or atomic<o:p></o:p></p>

</div>

<p class="MsoNormal">I have tried to meet all those requirement for x86, so the transform is still available in trunk. If I've missed a predicate, it should be considered a bug.<o:p></o:p></p>

<div>

<p class="MsoNormal"><br>

<br>

> Wonder in what situations one may know (at compile time) that both the first and the last bit of a mask are on?<br>

<br>

The main motivation was to make sure that all masked move operations were optimally supported by the x86 backend such that we could replace any regular AVX vector load/store with a masked op (including 'all' and 'none' masks) in source. This helps hand-coded,

 but possibly very templated, vector source code to perform as well as specialized vector code. In theory, it should also allow the auto-vectorizers to produce more efficient prologue/epilog loop code, but that has not been implemented yet AFAIK.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><br>

The "load doughnut" optimization was just something I noticed while handling the expected patterns, so I thought I better throw it in too. :)<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal" style="margin-bottom:12.0pt"><br>

<br>

><br>

>  <br>

><br>

> Thanks for rL263446,<br>

><br>

> Ayal.<br>

><br>

>  <br>

><br>

>  <br>

><br>

> From: llvm-dev [mailto:<a href="mailto:llvm-dev-bounces@lists.llvm.org">llvm-dev-bounces@lists.llvm.org</a>] On Behalf Of Sanjay Patel via llvm-dev<br>

> Sent: Friday, March 11, 2016 18:57<br>

> To: Nema, Ashutosh <<a href="mailto:Ashutosh.Nema@amd.com">Ashutosh.Nema@amd.com</a>><br>

> Cc: llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

> Subject: Re: [llvm-dev] masked-load endpoints optimization<br>

><br>

>  <br>

><br>

> Thanks, Ashutosh.<br>

><br>

> Yes, either TTI or TLI could be used to limit the transform if we do it in CGP rather than the DAG.<br>

><br>

> The real question I have is whether it is legal to read the extra memory, regardless of whether this is a masked load or something else.<br>

><br>

> Note that the x86 backend already does this, so either my proposal is ok for x86, or we're already doing an illegal optimization:<br>

><br>

><br>

> define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {<br>

>   %ld1 = load i32, i32* %addr1<br>

>   %addr2 = getelementptr i32, i32* %addr1, i64 3<br>

>   %ld2 = load i32, i32* %addr2<br>

>   %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0<br>

>   %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3<br>

>   ret <4 x i32> %vec2<br>

> }<br>

><br>

> $ ./llc -o - loadcombine.ll<br>

> ...<br>

>     movups    (%rdi), %xmm0<br>

>     retq<br>

><br>

><br>

>  <br>

><br>

> On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <<a href="mailto:Ashutosh.Nema@amd.com">Ashutosh.Nema@amd.com</a>> wrote:<br>

><br>

> This looks interesting, the main motivation appears to be replacing masked vector load with a general vector load followed by a select.<br>

><br>

>  <br>

><br>

> Observed masked vector loads are in general expensive in comparison with a vector load.<br>

><br>

>  <br>

><br>

> But if first & last element of a masked vector load are guaranteed to be accessed then it can be transformed to a vector load.<br>

><br>

>  <br>

><br>

> In opt this can be driven by TTI, where the benefit of this transformation should be checked.<br>

><br>

>  <br>

><br>

> Regards,<br>

><br>

> Ashutosh<br>

><br>

>  <br>

><br>

> From: llvm-dev [mailto:<a href="mailto:llvm-dev-bounces@lists.llvm.org">llvm-dev-bounces@lists.llvm.org</a>] On Behalf Of Sanjay Patel via llvm-dev<br>

> Sent: Friday, March 11, 2016 3:37 AM<br>

> To: llvm-dev<br>

> Subject: [llvm-dev] masked-load endpoints optimization<br>

><br>

>  <br>

><br>

> If we're loading the first and last elements of a vector using a masked load [1], can we replace the masked load with a full vector load?<br>

><br>

> "The result of this operation is equivalent to a regular vector load instruction followed by a ‘select’ between the loaded and the passthru values, predicated on the same mask. However, using this intrinsic prevents exceptions on memory access to masked-off

 lanes."<br>

><br>

> I think the fact that we're loading the endpoints of the vector guarantees that a full vector load can't have any different faulting/exception behavior on x86 and most (?) other targets. We would, however, be reading memory that the program has not explicitly

 requested.<br>

><br>

> IR example:<br>

><br>

> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {<br>

><br>

>   ; load the first and last elements pointed to by %addr and shuffle those into %v<br>

><br>

>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4, <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)<br>

>   ret <4 x i32> %res<br>

> }<br>

><br>

> would become something like:<br>

><br>

><br>

> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {<br>

><br>

>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4<br>

><br>

>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload, <4 x i32> %v<br>

><br>

>   ret <4 x i32> %sel<br>

> }<br>

><br>

> If this isn't valid as an IR optimization, would it be acceptable as a DAG combine with target hook to opt in?<br>

><br>

><br>

> [1] <a href="http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics">http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics</a><br>

><br>

>  <br>

><br>

> ---------------------------------------------------------------------<br>

> Intel Israel (74) Limited<br>

><br>

> This e-mail and any attachments may contain confidential material for<br>

> the sole use of the intended recipient(s). Any review or distribution<br>

> by others is strictly prohibited. If you are not the intended<br>

> recipient, please contact the sender and delete all copies.<o:p></o:p></p>

</div>

</div>

</div>

</div>

<p>---------------------------------------------------------------------<br>

Intel Israel (74) Limited</p>

<p>This e-mail and any attachments may contain confidential material for<br>

the sole use of the intended recipient(s). Any review or distribution<br>

by others is strictly prohibited. If you are not the intended<br>

recipient, please contact the sender and delete all copies.</p></body>

</html>