<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 05/04/2019 16:26, Sander De Smalen
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:BB39E23A-CC39-4638-97E7-42EDC563E311@arm.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      Hi Simon,
      <div class=""><br class="">
      </div>
      <div class="">Thanks for your feedback! See my comments inline.
        <div class=""><br class="">
          <div>
            <blockquote type="cite" class="">
              <div class="">On 5 Apr 2019, at 09:47, Simon Pilgrim via
                llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org"
                  class="" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
                wrote:</div>
              <br class="Apple-interchange-newline">
              <div class="">
                <div class="moz-cite-prefix" style="caret-color: rgb(0,
                  0, 0); font-family: Helvetica; font-size: 12px;
                  font-style: normal; font-variant-caps: normal;
                  font-weight: normal; letter-spacing: normal;
                  text-align: start; text-indent: 0px; text-transform:
                  none; white-space: normal; word-spacing: 0px;
                  -webkit-text-stroke-width: 0px; background-color:
                  rgb(255, 255, 255); text-decoration: none;">
                  <br class="Apple-interchange-newline">
                  On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:<br
                    class="">
                </div>
                <blockquote type="cite"
                  cite="mid:d306cf98-1225-732d-8016-7e882b5136b1@redking.me.uk"
                  style="font-family: Helvetica; font-size: 12px;
                  font-style: normal; font-variant-caps: normal;
                  font-weight: normal; letter-spacing: normal; orphans:
                  auto; text-align: start; text-indent: 0px;
                  text-transform: none; white-space: normal; widows:
                  auto; word-spacing: 0px; -webkit-text-size-adjust:
                  auto; -webkit-text-stroke-width: 0px;
                  background-color: rgb(255, 255, 255); text-decoration:
                  none;" class="">
                  <div class="moz-cite-prefix">On 04/04/2019 14:11,
                    Sander De Smalen wrote:<br class="">
                  </div>
                  <blockquote type="cite"
                    cite="mid:67D8F282-9E37-473F-9973-AA981D992711@arm.com"
                    class="">
                    <div class="WordSection1" style="page:
                      WordSection1;"><span style="font-size: 11pt;"
                        class="">Proposed change:<o:p class=""></o:p></span>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">----------------------------<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">In this
                          RFC I propose changing the intrinsics for
                          llvm.experimental.vector.reduce.fadd and
                          llvm.experimental.vector.reduce.fmul (see
                          options A and B). I also propose renaming the
                          'accumulator' operand to 'start value' because
                          for fmul this is the start value of the
                          reduction, rather than a value to which the
                          fmul reduction is accumulated into.<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">[Option
                          A] Always using the start value operand in the
                          reduction (<a
                            href="https://reviews.llvm.org/D60261"
                            moz-do-not-send="true" style="color:
                            rgb(149, 79, 114); text-decoration:
                            underline;" class="">https://reviews.llvm.org/D60261</a>)<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""> 
                          declare float
                          @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float
                          %start_value, <4 x float> %vec)<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">This
                          means that if the start value is 'undef', the
                          result will be undef and all code creating
                          such a reduction will need to ensure it has a
                          sensible start value (e.g. 0.0 for fadd, 1.0
                          for fmul). When using 'fast' or ‘reassoc’ on
                          the call it will be implemented using an
                          unordered reduction, otherwise it will be
                          implemented with an ordered reduction. Note
                          that a new intrinsic is required to capture
                          the new semantics. In this proposal the
                          intrinsic is prefixed with a 'v2' for the time
                          being, with the expectation this will be
                          dropped when we remove 'experimental' from the
                          reduction intrinsics in the future.</span><span
                          style="font-size: 11pt; font-family: "MS
                          Gothic";" class=""><o:p class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">[Option
                          B] Having separate ordered and unordered
                          intrinsics (<a
                            href="https://reviews.llvm.org/D60262"
                            moz-do-not-send="true" style="color:
                            rgb(149, 79, 114); text-decoration:
                            underline;" class="">https://reviews.llvm.org/D60262</a>).<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""> 
                          declare float
                          @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float
                          %start_value, <4 x float> %vec)<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""> 
                          declare float
                          @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4
                          x float> %vec)<o:p class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">This
                          will mean that the behaviour is explicit from
                          the intrinsic and the use of 'fast' or
                          ‘reassoc’ on the call has no effect on how
                          that intrinsic is lowered. The ordered
                          reduction intrinsic will take a scalar
                          start-value operand, where the unordered
                          reduction intrinsic will only take a vector
                          operand.<o:p class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class=""><o:p
                            class=""> </o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">Both
                          options auto-upgrade the IR to use the new
                          (version of the) intrinsics. I'm personally
                          slightly in favour of [Option B], because it
                          better aligns with the definition of the
                          SelectionDAG nodes and is more explicit in its
                          semantics. We also avoid having to use an
                          artificial 'v2' like prefix to denote the new
                          behaviour of the intrinsic.<o:p class=""></o:p></span></div>
                      <span style="font-size: 11pt;" class=""><o:p
                          class=""></o:p></span></div>
                  </blockquote>
                  <p class="">Do we have any targets with instructions
                    that can actually use the start value? TBH I'd be
                    tempted to suggest we just make the initial
                    extractelement/fadd/insertelement pattern a manual
                    extra stage and avoid having having that argument
                    entirely.<span class="Apple-converted-space"> </span><br
                      class="">
                  </p>
                </blockquote>
              </div>
            </blockquote>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
              ARM SVE has the FADDA instruction for strict fadd
              reductions (see for example test/MC/AArch64/SVE/fadda.s).
              This instruction takes an explicit start-value operand.
              The reduction intrinsics were originally introduced for
              SVE where we modelled the fadd/fmul reductions with this
              instruction in mind.</div>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
              <br class="">
            </div>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal;" class=""><font class="" face="Helvetica Neue">Just
                to clarify, is this what you are suggesting regarding
                extract/fadd/insert?<br class="">
                <br class="">
                  %first = extractelement <4 x float> %input, i32
                0<br class="">
                  %first.new = fadd float %start, %first<br class="">
                  %input.new = insertelement <4 x float> %input,
                float %first.new, i32 0<br class="">
                  %red = call float
                @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4
                x float> %input.new)<br class="">
                <br class="">
                My only reservation here is that LLVM might obfuscate
                this code so that CodeGen couldn't easily match the
                extract/fadd/insert pattern, thus adding the extra fadd
                instruction. This could for example happen if the loop
                would be rotated/pipelined to load the next iteration
                and doing the first 'fadd' before the next iteration. </font><span
                style="font-family: "Helvetica Neue";"
                class="">In such case having the extra operand would be
                more descriptive.</span></div>
          </div>
        </div>
      </div>
    </blockquote>
    <p>Yes that was the IR I had in mind, but you're right in that its
      probably useful for chained fadd reductions as well as the SVE
      specific instruction. If we're getting rid of the fast math
      'undef' special case and we expect a 'identity' start value (fadd
      = 0.0f, fmul = 1.0f) that we can optimize away then I've no
      objections.</p>
    <blockquote type="cite"
      cite="mid:BB39E23A-CC39-4638-97E7-42EDC563E311@arm.com">
      <div class="">
        <div class="">
          <div>
            <blockquote type="cite" class="">
              <div class="">
                <blockquote type="cite"
                  cite="mid:d306cf98-1225-732d-8016-7e882b5136b1@redking.me.uk"
                  style="font-family: Helvetica; font-size: 12px;
                  font-style: normal; font-variant-caps: normal;
                  font-weight: normal; letter-spacing: normal; orphans:
                  auto; text-align: start; text-indent: 0px;
                  text-transform: none; white-space: normal; widows:
                  auto; word-spacing: 0px; -webkit-text-size-adjust:
                  auto; -webkit-text-stroke-width: 0px;
                  background-color: rgb(255, 255, 255); text-decoration:
                  none;" class="">
                  <blockquote type="cite"
                    cite="mid:67D8F282-9E37-473F-9973-AA981D992711@arm.com"
                    class="">
                    <div class="WordSection1" style="page:
                      WordSection1;">
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">Further
                          efforts:<o:p class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">----------------------------<o:p
                            class=""></o:p></span></div>
                      <div style="margin: 0cm 0cm 0.0001pt; font-size:
                        12pt; font-family: Calibri, sans-serif;"
                        class="">
                        <span style="font-size: 11pt;" class="">Here a
                          non-exhaustive list of items I think work
                          towards making the intrinsics
                          non-experimental:</span><span
                          style="font-size: 11pt; font-family: "MS
                          Gothic";" class="" lang="EN-US">
</span><span
                          style="font-size: 11pt;" class=""><o:p
                            class=""></o:p></span></div>
                      <ul style="margin-bottom: 0cm; margin-top: 0cm;"
                        class="" type="disc">
                        <li class="MsoListParagraph" style="margin: 0cm
                          0cm 0.0001pt -18pt; font-size: 12pt;
                          font-family: Calibri, sans-serif;">
                          <span style="font-size: 11pt;" class="">Adding
                            SelectionDAG legalization for the  _STRICT
                            reduction SDNodes. After some great work
                            from Nikita in D58015, unordered reductions
                            are now legalized/expanded in SelectionDAG,
                            so if we add expansion in SelectionDAG for
                            strict reductions this would make the
                            ExpandReductionsPass redundant.<o:p class=""></o:p></span></li>
                        <li class="MsoListParagraph" style="margin: 0cm
                          0cm 0.0001pt -18pt; font-size: 12pt;
                          font-family: Calibri, sans-serif;">
                          <span style="font-size: 11pt;" class="">Better
                            enforcing the constraints of the intrinsics
                            (see<span class="Apple-converted-space"> </span><a
                              href="https://reviews.llvm.org/D60260"
                              moz-do-not-send="true" style="color:
                              rgb(149, 79, 114); text-decoration:
                              underline;" class="">https://reviews.llvm.org/D60260</a><span
                              class="Apple-converted-space"> </span>).</span><span
                            style="font-size: 11pt;" class=""
                            lang="EN-US">
</span><span style="font-size:
                            11pt;" class=""><o:p class=""></o:p></span></li>
                        <li class="MsoListParagraph" style="margin: 0cm
                          0cm 0.0001pt -18pt; font-size: 12pt;
                          font-family: Calibri, sans-serif;">
                          <span style="font-size: 11pt;" class="">I
                            think we'll also want to be able to overload
                            the result operand based on the vector
                            element type for the intrinsics having the
                            constraint that the result type must match
                            the vector element type. e.g. dropping the
                            redundant 'i32' in:</span><span
                            style="font-size: 11pt;" class=""><br
                              class="">
                             <span class="Apple-converted-space"> </span></span><span
                            style="font-size: 11pt;" class="">i32
                            @llvm.experimental.vector.reduce.and.i32.v4i32(<4
                            x i32> %a) => i32
                            @llvm.experimental.vector.reduce.and.v4i32(<4
                            x i32> %a)<o:p class=""></o:p></span></li>
                      </ul>
                      <div style="margin: 0cm 0cm 0.0001pt 18pt;
                        font-size: 12pt; font-family: Calibri,
                        sans-serif;" class="">
                        <span style="font-size: 11pt;" class="">since
                          i32 is implied by <4 x i32>. This would
                          have the added benefit that LLVM would
                          automatically check for the operands to match.</span><span
                          style="font-size: 11pt; font-family: "MS
                          Gothic";" class="" lang="EN-US">
</span></div>
                    </div>
                  </blockquote>
                  <p class="">Won't this cause issues with overflow?
                    Isn't the point  of an add (or mul....) reduction of
                    say, <64 x i8> giving a larger (i32 or i64)
                    result so we don't lose anything? I agree for bitop
                    reductions it doesn't make sense though.<br class="">
                  </p>
                </blockquote>
                <span style="caret-color: rgb(0, 0, 0); font-family:
                  Helvetica; font-size: 12px; font-style: normal;
                  font-variant-caps: normal; font-weight: normal;
                  letter-spacing: normal; text-align: start;
                  text-indent: 0px; text-transform: none; white-space:
                  normal; word-spacing: 0px; -webkit-text-stroke-width:
                  0px; background-color: rgb(255, 255, 255);
                  text-decoration: none; float: none; display: inline
                  !important;" class="">Sorry - I forgot to add: which
                  asks the question - should we be considering
                  signed/unsigned add/mul and possibly saturation
                  reductions?</span><br style="caret-color: rgb(0, 0,
                  0); font-family: Helvetica; font-size: 12px;
                  font-style: normal; font-variant-caps: normal;
                  font-weight: normal; letter-spacing: normal;
                  text-align: start; text-indent: 0px; text-transform:
                  none; white-space: normal; word-spacing: 0px;
                  -webkit-text-stroke-width: 0px; background-color:
                  rgb(255, 255, 255); text-decoration: none;" class="">
              </div>
            </blockquote>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
              The current intrinsics explicitly specify that:</div>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
                 "The return type matches the element-type of the vector
              input"</div>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
              <br class="">
            </div>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
              This was done to avoid having explicit signed/unsigned add
              reductions, reasoning that zero- and sign-extension can be
              done on the input values to the reduction. We had a bit of
              debate on this internally, and it would come down to
              similar reasons as for the extra 'start value' operand to
              fadd reductions. I think we'd welcome the signed/unsigned
              variants as they would be more descriptive and would
              safeguard the code from transformations that make it
              difficult to fold the sign/zero extend into the operation
              during CodeGen. The downside however is that for
              signed/unsigned add reductions it would mean that both
              operations are the same when the result type equals the
              element type.</div>
          </div>
        </div>
      </div>
    </blockquote>
    <p>An alternative would be that we limit the existing add/mul cases
      to the same result type (along with
      and/or/xor/smax/smin/umax/umin) and we add sadd/uadd/smul/umul
      extending reductions as well.</p>
    <blockquote type="cite"
      cite="mid:BB39E23A-CC39-4638-97E7-42EDC563E311@arm.com">
      <div class="">
        <div class="">
          <div>
            <div style="margin: 0px; font-stretch: normal; line-height:
              normal; font-family: "Helvetica Neue";" class="">
              <div style="margin: 0px; font-stretch: normal;
                line-height: normal;" class="">Saturating vector
                reductions sound sensible, but are there any targets
                that implement these at the moment?</div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    X86/SSE has the v8i16 HADDS/HSUBS horizontal signed saturation
    instructions, and X86/XOP has extend+horizontal-add/sub instructions
    (<a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/XOP_instruction_set">https://en.wikipedia.org/wiki/XOP_instruction_set</a>).<br>
  </body>
</html>