<div dir="ltr"><div><font face="monospace, monospace">Hi David,</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">The commute of 1st<->2nd and 1st<->3rd operands is _usually_ prohibited </font></div><div><font face="monospace, monospace">for scalar FMA *_Int opcodes </font><span style="font-family:monospace,monospace">because it would change the values passed</span></div><div><span style="font-family:monospace,monospace">through the first operand of intrinsic.</span></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">I would challenge your statement:</font></div><div><font face="monospace, monospace">  "user cannot rely on knowing which operand is tied to the destination".</font></div><div><font face="monospace, monospace">It is the common practice for all intrinsics with *_ss() and *_sd() suffixes that the first o</font><span style="font-family:monospace,monospace">perand of the intrinsic is tied to the destination.</span></div><div><font face="monospace, monospace">For example:</font></div><div><font face="monospace, monospace">    // <a href="https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf">https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf</a></font></div><div><font face="monospace, monospace">    __m128 _mm_add_ss(__m128 a, __m128 b)</font></div><div><font face="monospace, monospace">    Adds the lower single-precision, floating-point (SP FP) values of a and b; </font></div><div><font face="monospace, monospace"><span class="gmail-Apple-tab-span" style="white-space:pre"> </span>the upper 3<span style="white-space:pre"> </span>SP FP values are passed through from a.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Probably, this moment was not mentioned explicitly for FMA intrinsics here:</font></div><div><font face="monospace, monospace"><span class="gmail-Apple-tab-span" style="white-space:pre">        </span><a href="https://software.intel.com/en-us/node/582845">https://software.intel.com/en-us/node/582845</a></font></div><div><font face="monospace, monospace">That is rather a documentation problem (actually, my fault, </font></div><div><font face="monospace, monospace">as I did not add a special notice when created/added those new _mm_fmadd_ss/sd() intrinsics).</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">The intention was to maintain the existing assumption regarding the 1st intrinsic operand as usually and </font><span style="font-family:monospace,monospace">let users (including some math library guys) the tool that would have defined input/output behavior</span><span style="font-family:monospace,monospace">.</span></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">It is important to mention that the FMA form selection (132/213/231) by compiler </font></div><div><font face="monospace, monospace">does not change the precision of the result. Is is always correct for vector opcodes and</font></div><div><font face="monospace, monospace">conditionally correct for *_Int opcodes. </font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">*_Int opcodes may need some additional correctness </font><span style="font-family:monospace,monospace">analysis.</span></div><div><span style="font-family:monospace,monospace">Commuting 2nd and 3rd operands is always correct, while commuting 1st and 2nd or 1st and 3rd</span></div><div><span style="font-family:monospace,monospace">requires use-def analysis.</span></div><div><span style="font-family:monospace,monospace">It</span><span style="font-family:monospace,monospace"> is Ok to commute the 1st operand </span><font face="monospace, monospace">if it is known that the upper </font><span style="font-family:monospace,monospace">bits </span></div><div><span style="font-family:monospace,monospace">of the intrinsic result are not used.</span></div><div><span style="font-family:monospace,monospace">For example:</span></div><div><font face="monospace, monospace">  __m128 res = _mm_fmadd_ss(a, b, c);</font></div><div><font face="monospace, monospace">  _mm_store_ss(ptr, res); // this is the ONLY user of 'res'.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">I did not see such use-def analysis in LLVM, but surely such exist in some other compilers.</font></div><div><font face="monospace, monospace">Perhaps such analysis would be implemented in LLVM eventually/soon.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Thank you,</font></div><div><font face="monospace, monospace">Vyacheslav Klochkov</font></div><div><font face="monospace, monospace"><br></font></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 12, 2016 at 10:24 AM,  <span dir="ltr"><<a href="mailto:dag@cray.com" target="_blank">dag@cray.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I noticed that the operand commuting code in X86InstrInfo.cpp treats<br>

scalar FMA intrinsics specially.  It prevents operand commuting on these<br>

scalar instructions because the scalar FMA instructions preserve the<br>

upper bits of the vector.  Presumably, the restrictions are there<br>

because commuting operands potentially changes the result upper bits.<br>

<br>

However, AFAIK the Intel and GNU FMA intrinsics don't actually specify<br>

which FMA (213, 132, 231) is going to be used and so the user can't rely<br>

on knowing which operand is tied to the destination.  Thus the user<br>

can't rely on knowing what the upper bits will be.<br>

<br>

Is there some other reason these scalar FMA commuting restrictions are<br>

in place?<br>

<br>

Thanks!<br>

<br>

                            -David<br>

</blockquote></div><br></div>