[LLVMdev] [RFC] Stripping unusable intrinsics

Tue Dec 23 15:52:56 PST 2014

On Tue, Dec 23, 2014 at 3:07 PM, Owen Anderson <resistor at mac.com> wrote:

> On Dec 23, 2014, at 1:40 PM, Chandler Carruth <chandlerc at google.com>
> wrote:
>
> If we're going to talk about what the right long-term design is, let me
> put out a different opinion. I used to be somewhat torn on this issue, but
> this discussion and looking at the particular intrinsics in question, I'm
> rapidly being persuaded.
>
> We shouldn't have any target specific intrinsics. At the very least, we
> shouldn't use them anywhere in the front- or middle-end, even if we have
> them.
>
> Today, frontends need to emit specific target instrinsics *and* have the
> optimizer be aware of them. I can see a few reasons why:
>
> 1) Missing semantics -- the IR may not have *quite* the semantics desired
> and provided by the target's ISA.
> 2) Historical expectations -- the GCC-compatible builtins are named after
> the instructions, and the target independent builtins lower to intrinsics
> so the target-specific ones should too.
> 3) Poor instruction selection -- we could emit the logic as boring IR, but
> we fail to instruction select that well, so as a hack we emit the
> instruction directly and teach the optimizer to still optimize through it.
>
> If we want to pursue the *right* design, I think we should be fixing these
> three issues and then we won't need the optimizer to be aware of any of
> this.
>
>
> I strongly disagree with your conclusions here.  Everything you’re
> suggesting is rooted in three base assumptions that are not true for many
> clients:
> - that all source languages are basically C
> - that all programming models are more or less like C on a *nix system
> - that all hardware is basically like the intersection of X86 and ARM
> (“typical RISC machine”)
>

As it happens, I don't hold these assumptions. I may be wrong in my
suggested design, but that is most likely because I am simply wrong, not
because I'm unconcerned with the use cases you describe below.

>
> Consider the use case of an OpenGL shader compiler.  Its source language
> is not C (despite syntactic appearances) and the frontend may need to
> express semantics that are difficult or impossible to express in
> target-independent IR.  Its programming model is not like a C compiler,
> including constructs like cross-thread derivatives, uniform vs varying
> calculations, etc.  It’s target instruction set is likely nothing at all
> like X86 or ARM, likely including an arithmetic set that is very different
> from your typical CPU, as well as lots of ISA-level construct for
> interacting with various fixed function hardware units.
>
> Consider the less exotic use case of a DSP compiler.  DSPs typically have
> lots of instructions for “unusual” arithmetic operations that are intended
> to map to very specific use cases: lots of variants of rounding and/or
> wrapping control, lots of extending/widening/doubling operations, memory
> accesses with unusual stride patterns.  The entire purpose of the existence
> of a DSP is to deliver high computation bandwidth under tight latency
> constraints.  If your DSP compiler fails to make use of exotic arithmetic
> operations that the user requested, the whole system has *failed* at being
> a DSP.
>
> Consider the even-closer-to-home use case of vector programming.  There
> are three major families of vector extensions in widespread use (SSE, NEON,
> and Altivec) as well as many variants and lesser-known instruction sets.
> And while all three agree on a small core of functionality (fadd <4 x
> float> !), all of them include large bodies of just plain arithmetic that
> are not covered by the others and are not practically expressible in
> target-independent IR.  Even if we add the union of their functionalities
> to target independent IR, then we have the reverse problem where the
> frontend and optimizers may produce IR that most backends have little to no
> hope of generating good code for.  And let’s not forget that, while the
> requirements here are somewhat less strict than on a DSP, our users will
> still be very unhappy if they write a
> top-half-extending-saturating-absolute-difference builtin and we give them
> 100 instructions of emulated gunk back.
>

FWIW, I don't really disagree with any of this....

> While I agree with the underlying sentiment that we should strive to
> minimize the intrusion of target-specific intrinsics as much as possible,
> and compartmentalizing them into their source backends as much as possible,
> expecting to reach a world with no intrinsic considerations in any part of
> the frontend or optimizer just seems hopelessly idealistic.
>

I think maybe you are interpreting my suggestion as a more black and white
thing than I was trying to propose....

First off, I assume we will always have intrinsics that represent things
that exist on some hardware, not all, and perhaps aren't so pervasive is to
merit instructions. We have many of these already, ranging from math
library functions that sometimes have hardware implementatiions like square
root to bit counting operations like ctpop. I'm not suggesting these would
go away. I actually suspect there are a number of places where we should
add more of these to handle edge cases that just aren't *that* uncommon in
both source code and hardware. And here I'm including DSP, GPU, and every
other form of source code I can think of....

Second, I am assuming we will still need *some* way for frontends,
especially some of the domain-focused ones you highlight, to communicate
*very* precise operations to the backends, especially some of the
domain-focused backends you highlight. I'm sorry if I down-played this, but
I assume that will always exist in some form.

So, what I was trying to point out is that it isn't clear we need to have
the ability to teach the middle end optimizer about the second set above.
For example, the only place where I can find us dealing with intrinsics
from Hexagon, r600/AMDGPU, or NVPTX in the middle end is for AMDGPU_rcp
which has an instcombine. While r600/AMDGPU doesn't really have a lot of
intrinsics anyways, NVPTX seems to have many of the kinds of intrinsics
that would be directly relevant to GPU shaders... But maybe there is
something about how people are using NVPTX that makes this a bad example?

The largest contribution of target-specific intrinsics in the middle end
optimizer is actually x86, and I'm pretty confident that we can and
probably should remove most of that code. The operations we optimize there
don't actually seem special at all, I suspect this is more a consequence of
the historical needs than anything else. (I mean, I'm pretty sure I added
some of those combines for x86!)

Anyways, maybe this doesn't actually work for other users of LLVM. If it
doesn't, I would genuinely like to know why. Currently, I don't see where
the problems are, but that's why we have mailing list discussions.

And regardless, I stand by the claim that I don't think it is a small or
reasonable amount of work (no matter which design!) if the goal is just to
make LLVM's libraries less bloated for specific users. That seems like an
important use case that we should be able to solve quickly and without
major surgery of any kind....
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/d40a19ea/attachment.html>