<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 29 August 2017 at 17:30, Tom Westerhout via cfe-dev <span dir="ltr"><<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 30/08/2017, Richard Smith <<a href="mailto:richard@metafoo.co.uk">richard@metafoo.co.uk</a>> wrote:<br>

> On 29 August 2017 at 16:51, Tom Westerhout via cfe-dev<br>

> <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:<br>

</span><span class="">>> Anyway, could you maybe point me to an example to play around of user<br>

>> code specifying the materialisation process?<br>

><br>

> My observation was that such user code does not actually exist / work,<br>

> because the vector operations get folded together at the IR level. That<br>

> is: the objection to constant evaluation of vector operations in the<br>

> frontend does not appear to be a valid objection (perhaps it once was,<br>

> before the middle-end optimizers started optimizing vector operations,<br>

> but not any more).<br>

<br>

</span>OK, so essentially that's a go on trying to implement it, right? I'll<br>

probably take some time before I come up with a PR as I'm completely<br>

unfamiliar with the code base.</blockquote><div><br></div><div>Yes, please go for it :)</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

>> > Example: on x86_64, v4si{-1, -1, -1, -1} + v4si{2, 0, 0, 0} can be<br>

>> > emitted as four instructions (pcmpeqd, mov, movd, paddd) totalling<br>

>> > 17 bytes, or as one movaps (7 bytes) plus a 16 byte immediate; the<br>

>> > former is both smaller and a little faster, but LLVM is only able to<br>

>> > produce the latter today.  LLVM is smart enough to produce good code<br>

>> > for those two constants in isolation, but not for v4si{1, -1, -1,<br>

>> > -1}.<br>

>><br>

>> I don't quite get it. Any chance you could provide a small piece of<br>

>> code illustrating your point?<br>

>><br>

><br>

> Sure:<br>

><br>

> v4si f() {<br>

>     return v4si{-1,-1,-1,-1} + v4si{2,0,0,0};<br>

> }<br>

><br>

> v4si g() {<br>

>   v4si result;<br>

>   asm(R"(pcmpeqd %0, %0<br>

>         movl $2, %%eax<br>

>         movd %%eax, %%xmm1<br>

>         paddd %%xmm1, %0)" : "=x"(result) : : "eax", "xmm1");<br>

>   return result;<br>

> }<br>

><br>

> LLVM will materialize v4si{-1,-1,-1,-1} as pcmpeqd, and it will<br>

> materialize {2,0,0,0} as movl + movd. But the code it produces for f()<br>

> is larger and slower than the code for g() (which is the naive<br>

> combination of what it did for the two constants in isolation), because<br>

> the vector operations got folded together.<br>

<br>

</span>Aha, thanks, I get it now. It's interesting though that f() gets<br>

implemented in a single movaps instruction: <a href="https://godbolt.org/g/azWbby" rel="noreferrer" target="_blank">https://godbolt.org/g/azWbby</a></blockquote></div></div></div>