<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 23, 2014 at 3:07 PM, Owen Anderson <span dir="ltr"><<a href="mailto:resistor@mac.com" target="_blank">resistor@mac.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><div><blockquote type="cite"><div>On Dec 23, 2014, at 1:40 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com" target="_blank">chandlerc@google.com</a>> wrote:</div><div><div dir="ltr" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>If we're going to talk about what the right long-term design is, let me put out a different opinion. I used to be somewhat torn on this issue, but this discussion and looking at the particular intrinsics in question, I'm rapidly being persuaded.</div><div><br></div><div>We shouldn't have any target specific intrinsics. At the very least, we shouldn't use them anywhere in the front- or middle-end, even if we have them.</div><div><br></div><div>Today, frontends need to emit specific target instrinsics *and* have the optimizer be aware of them. I can see a few reasons why:</div><div><br></div><div>1) Missing semantics -- the IR may not have *quite* the semantics desired and provided by the target's ISA.</div><div>2) Historical expectations -- the GCC-compatible builtins are named after the instructions, and the target independent builtins lower to intrinsics so the target-specific ones should too.</div><div>3) Poor instruction selection -- we could emit the logic as boring IR, but we fail to instruction select that well, so as a hack we emit the instruction directly and teach the optimizer to still optimize through it.</div><div><br></div><div>If we want to pursue the *right* design, I think we should be fixing these three issues and then we won't need the optimizer to be aware of any of this.</div></div></div></div></div></blockquote><br></div></span><div>I strongly disagree with your conclusions here.  Everything you’re suggesting is rooted in three base assumptions that are not true for many clients:</div><div><span style="white-space:pre-wrap">      </span>- that all source languages are basically C</div><div><span style="white-space:pre-wrap">      </span>- that all programming models are more or less like C on a *nix system</div><div><span style="white-space:pre-wrap">   </span>- that all hardware is basically like the intersection of X86 and ARM (“typical RISC machine”)</div></blockquote><div><br></div><div>As it happens, I don't hold these assumptions. I may be wrong in my suggested design, but that is most likely because I am simply wrong, not because I'm unconcerned with the use cases you describe below.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>Consider the use case of an OpenGL shader compiler.  Its source language is not C (despite syntactic appearances) and the frontend may need to express semantics that are difficult or impossible to express in target-independent IR.  Its programming model is not like a C compiler, including constructs like cross-thread derivatives, uniform vs varying calculations, etc.  It’s target instruction set is likely nothing at all like X86 or ARM, likely including an arithmetic set that is very different from your typical CPU, as well as lots of ISA-level construct for interacting with various fixed function hardware units.</div><div><br></div><div>Consider the less exotic use case of a DSP compiler.  DSPs typically have lots of instructions for “unusual” arithmetic operations that are intended to map to very specific use cases: lots of variants of rounding and/or wrapping control, lots of extending/widening/doubling operations, memory accesses with unusual stride patterns.  The entire purpose of the existence of a DSP is to deliver high computation bandwidth under tight latency constraints.  If your DSP compiler fails to make use of exotic arithmetic operations that the user requested, the whole system has *failed* at being a DSP.</div><div><br></div><div>Consider the even-closer-to-home use case of vector programming.  There are three major families of vector extensions in widespread use (SSE, NEON, and Altivec) as well as many variants and lesser-known instruction sets. And while all three agree on a small core of functionality (fadd <4 x float> !), all of them include large bodies of just plain arithmetic that are not covered by the others and are not practically expressible in target-independent IR.  Even if we add the union of their functionalities to target independent IR, then we have the reverse problem where the frontend and optimizers may produce IR that most backends have little to no hope of generating good code for.  And let’s not forget that, while the requirements here are somewhat less strict than on a DSP, our users will still be very unhappy if they write a top-half-extending-saturating-absolute-difference builtin and we give them 100 instructions of emulated gunk back.</div></blockquote><div><br></div><div>FWIW, I don't really disagree with any of this....</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>While I agree with the underlying sentiment that we should strive to minimize the intrusion of target-specific intrinsics as much as possible, and compartmentalizing them into their source backends as much as possible, expecting to reach a world with no intrinsic considerations in any part of the frontend or optimizer just seems hopelessly idealistic.</div></blockquote></div><br>I think maybe you are interpreting my suggestion as a more black and white thing than I was trying to propose....</div><div class="gmail_extra"><br></div><div class="gmail_extra">First off, I assume we will always have intrinsics that represent things that exist on some hardware, not all, and perhaps aren't so pervasive is to merit instructions. We have many of these already, ranging from math library functions that sometimes have hardware implementatiions like square root to bit counting operations like ctpop. I'm not suggesting these would go away. I actually suspect there are a number of places where we should add more of these to handle edge cases that just aren't *that* uncommon in both source code and hardware. And here I'm including DSP, GPU, and every other form of source code I can think of....</div><div class="gmail_extra"><br></div><div class="gmail_extra">Second, I am assuming we will still need *some* way for frontends, especially some of the domain-focused ones you highlight, to communicate *very* precise operations to the backends, especially some of the domain-focused backends you highlight. I'm sorry if I down-played this, but I assume that will always exist in some form.</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">So, what I was trying to point out is that it isn't clear we need to have the ability to teach the middle end optimizer about the second set above. For example, the only place where I can find us dealing with intrinsics from Hexagon, r600/AMDGPU, or NVPTX in the middle end is for AMDGPU_rcp which has an instcombine. While r600/AMDGPU doesn't really have a lot of intrinsics anyways, NVPTX seems to have many of the kinds of intrinsics that would be directly relevant to GPU shaders... But maybe there is something about how people are using NVPTX that makes this a bad example?</div><div class="gmail_extra"><br></div><div class="gmail_extra">The largest contribution of target-specific intrinsics in the middle end optimizer is actually x86, and I'm pretty confident that we can and probably should remove most of that code. The operations we optimize there don't actually seem special at all, I suspect this is more a consequence of the historical needs than anything else. (I mean, I'm pretty sure I added some of those combines for x86!)</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Anyways, maybe this doesn't actually work for other users of LLVM. If it doesn't, I would genuinely like to know why. Currently, I don't see where the problems are, but that's why we have mailing list discussions.</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">And regardless, I stand by the claim that I don't think it is a small or reasonable amount of work (no matter which design!) if the goal is just to make LLVM's libraries less bloated for specific users. That seems like an important use case that we should be able to solve quickly and without major surgery of any kind....</div></div>