<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--

/* Font Definitions */

@font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:"\@SimSun";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri","sans-serif";}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Hi all,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>ping…<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>I agree with Jiangning. I’m still confused with why wide store version is better than narrow stores version. If we think wide store should be finally optimized into narrow stores version, why don’t we optimize it in the SROA at the very beginning when we generating SHL/ZEXT/OR.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Thanks,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>-Hao<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><div style='border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt'><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> llvm-commits-bounces@cs.uiuc.edu [mailto:llvm-commits-bounces@cs.uiuc.edu] <b>On Behalf Of </b>Jiangning Liu<br><b>Sent:</b> Wednesday, September 03, 2014 12:58 PM<br><b>To:</b> James Molloy<br><b>Cc:</b> Commit Messages and Patches for LLVM; reviews+D4954+public+39a520765005fb8e@reviews.llvm.org<br><b>Subject:</b> Re: [PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><div><p class=MsoNormal>Hi James,<o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal style='margin-bottom:12.0pt'>Sorry, I'm still not convinced here.<o:p></o:p></p><div><p class=MsoNormal>2014-09-02 18:46 GMT+08:00 James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank">james@jamesmolloy.co.uk</a>>:<o:p></o:p></p><div><div><p class=MsoNormal>Hi all,<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>I've just thrashed this out further with Chandler on IRC, and thought I'd summarize what came up here. Chandler will rebutt if I make a mistake, I'm sure :)<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>The main point is that there are instcombines and bunch of arithmetic-simplifying optimizations that rely on SROA producing wide stores. In fact, it isn't the fact that the store is wide that matters, it is that there is a merged SSA node.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%a = i32 ...</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%b = i32 ...</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%c = shl i64 (zext i32 %a to i64), i32 32</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%d = or i64 %c, (zext i32 %b to i64)</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>store i64* %p, i64 %d</span><o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%e = load i64* %p</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>...</span><o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>In the above example, optimizers can follow the load of %p through, via the store and end up with two constituent i32's that they can then do magic with. In the alternative scenario:<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%a = i32 ...</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%b = i32 ...</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%q = bitcast i64* %p to i32*</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>store i32* (getelementptr i32* %q, i32 0), i32 %a</span><o:p></o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>store i32* (getelementptr i32* %a, i32 1), i32 %b</span><o:p></o:p></p></div></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>I think this is a typo, it should be %q rather than %a.<o:p></o:p></p></div><div><p class=MsoNormal> <o:p></o:p></p></div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in'><div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><span style='font-family:"Courier New"'>%e = load i64* %p</span><o:p></o:p></p></div><div><p class=MsoNormal>...<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>The optimizer can no longer see easily that %e is the concatenation of %a and %b.<o:p></o:p></p></div></div></blockquote><div><p class=MsoNormal><o:p> </o:p></p></div><p class=MsoNormal>Why?<br><br>With this LLVM IR sequence, I think it is even easier to know %p is the combination of %a and %b than the SHIFT+OR solution, because those two separate stores are using two sequential getelementptr result, i.e. (i32* %q, i32 0) and (i32* %q, i32 1).<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>If InstCombine can cover SHIFT+OR case, why can't it cover separate stores as well? I think the latter one is even easier.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><div><p class=MsoNormal>Thanks,<o:p></o:p></p></div><div><p class=MsoNormal>-Jiangning<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in'><div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>This is an important property, and is the main reason for not splitting wide stores to match their loads. Most importantly, this optimization happens in InstCombine and we run InstCombine after SLP and Loop Vectorization, which means the IR should be in this form at least up until the end of vectorization (or we lose this optimization after vectorization).<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>This then means that we need to teach the vectorizers and code metrics about the cost of splitting concatenated stores, which is likely to be awkward but we are left with little choice.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Cheers,<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>James<o:p></o:p></p></div></div><div><div><div><p class=MsoNormal style='margin-bottom:12.0pt'><o:p> </o:p></p><div><p class=MsoNormal>On 2 September 2014 11:20, Jiangning Liu <<a href="mailto:liujiangning1@gmail.com" target="_blank">liujiangning1@gmail.com</a>> wrote:<o:p></o:p></p><div><p class=MsoNormal>Hi Chandler,<o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p><div><div><div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in'><div><div><div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Once you start slicing up memory accesses, *you break SSA form* and all of the analyses that depend on it. I cannot express how strongly I feel this is a very bad idea and the wrong direction in the middle end.<o:p></o:p></p></div><div><div><p class=MsoNormal><o:p> </o:p></p></div></div></div></div></div></blockquote><div><p class=MsoNormal><o:p> </o:p></p></div></div><div><p class=MsoNormal>What do you mean by "you break SSA form" for the case Hao's patch is solving? Do you mean some SSA form info could be lost?<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>The transformation of Hao's patch is to change a single wide store to two separate narrow stores, but the address of those two narrow stores are still sequential. For the memory stored here we don't have any SSA information attached at all, right?<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>So what SSA form information could be lost? And what optimization could be affected? Can you give an example?<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Thanks,<o:p></o:p></p></div><div><p class=MsoNormal>-Jiangning <o:p></o:p></p></div></div></div></div></div></div><p class=MsoNormal><o:p> </o:p></p></div></div></div></blockquote></div><p class=MsoNormal><o:p> </o:p></p></div></div></div></div></body></html>