<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 05/04/2019 16:26, Sander De Smalen

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:BB39E23A-CC39-4638-97E7-42EDC563E311@arm.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      Hi Simon,

      <div class=""><br class="">

      </div>

      <div class="">Thanks for your feedback! See my comments inline.

        <div class=""><br class="">

          <div>

            <blockquote type="cite" class="">

              <div class="">On 5 Apr 2019, at 09:47, Simon Pilgrim via

                llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org"

                  class="" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

                wrote:</div>

              <br class="Apple-interchange-newline">

              <div class="">

                <div class="moz-cite-prefix" style="caret-color: rgb(0,

                  0, 0); font-family: Helvetica; font-size: 12px;

                  font-style: normal; font-variant-caps: normal;

                  font-weight: normal; letter-spacing: normal;

                  text-align: start; text-indent: 0px; text-transform:

                  none; white-space: normal; word-spacing: 0px;

                  -webkit-text-stroke-width: 0px; background-color:

                  rgb(255, 255, 255); text-decoration: none;">

                  <br class="Apple-interchange-newline">

                  On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:<br

                    class="">

                </div>

                <blockquote type="cite"

                  cite="mid:d306cf98-1225-732d-8016-7e882b5136b1@redking.me.uk"

                  style="font-family: Helvetica; font-size: 12px;

                  font-style: normal; font-variant-caps: normal;

                  font-weight: normal; letter-spacing: normal; orphans:

                  auto; text-align: start; text-indent: 0px;

                  text-transform: none; white-space: normal; widows:

                  auto; word-spacing: 0px; -webkit-text-size-adjust:

                  auto; -webkit-text-stroke-width: 0px;

                  background-color: rgb(255, 255, 255); text-decoration:

                  none;" class="">

                  <div class="moz-cite-prefix">On 04/04/2019 14:11,

                    Sander De Smalen wrote:<br class="">

                  </div>

                  <blockquote type="cite"

                    cite="mid:67D8F282-9E37-473F-9973-AA981D992711@arm.com"

                    class="">

                    <div class="WordSection1" style="page:

                      WordSection1;"><span style="font-size: 11pt;"

                        class="">Proposed change:<o:p class=""></o:p></span>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">----------------------------<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">In this

                          RFC I propose changing the intrinsics for

                          llvm.experimental.vector.reduce.fadd and

                          llvm.experimental.vector.reduce.fmul (see

                          options A and B). I also propose renaming the

                          'accumulator' operand to 'start value' because

                          for fmul this is the start value of the

                          reduction, rather than a value to which the

                          fmul reduction is accumulated into.<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">[Option

                          A] Always using the start value operand in the

                          reduction (<a

                            href="https://reviews.llvm.org/D60261"

                            moz-do-not-send="true" style="color:

                            rgb(149, 79, 114); text-decoration:

                            underline;" class="">https://reviews.llvm.org/D60261</a>)<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""> 

                          declare float

                          @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float

                          %start_value, <4 x float> %vec)<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">This

                          means that if the start value is 'undef', the

                          result will be undef and all code creating

                          such a reduction will need to ensure it has a

                          sensible start value (e.g. 0.0 for fadd, 1.0

                          for fmul). When using 'fast' or ‘reassoc’ on

                          the call it will be implemented using an

                          unordered reduction, otherwise it will be

                          implemented with an ordered reduction. Note

                          that a new intrinsic is required to capture

                          the new semantics. In this proposal the

                          intrinsic is prefixed with a 'v2' for the time

                          being, with the expectation this will be

                          dropped when we remove 'experimental' from the

                          reduction intrinsics in the future.</span><span

                          style="font-size: 11pt; font-family: "MS

                          Gothic";" class=""><o:p class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">[Option

                          B] Having separate ordered and unordered

                          intrinsics (<a

                            href="https://reviews.llvm.org/D60262"

                            moz-do-not-send="true" style="color:

                            rgb(149, 79, 114); text-decoration:

                            underline;" class="">https://reviews.llvm.org/D60262</a>).<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""> 

                          declare float

                          @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float

                          %start_value, <4 x float> %vec)<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""> 

                          declare float

                          @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4

                          x float> %vec)<o:p class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">This

                          will mean that the behaviour is explicit from

                          the intrinsic and the use of 'fast' or

                          ‘reassoc’ on the call has no effect on how

                          that intrinsic is lowered. The ordered

                          reduction intrinsic will take a scalar

                          start-value operand, where the unordered

                          reduction intrinsic will only take a vector

                          operand.<o:p class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class=""><o:p

                            class=""> </o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">Both

                          options auto-upgrade the IR to use the new

                          (version of the) intrinsics. I'm personally

                          slightly in favour of [Option B], because it

                          better aligns with the definition of the

                          SelectionDAG nodes and is more explicit in its

                          semantics. We also avoid having to use an

                          artificial 'v2' like prefix to denote the new

                          behaviour of the intrinsic.<o:p class=""></o:p></span></div>

                      <span style="font-size: 11pt;" class=""><o:p

                          class=""></o:p></span></div>

                  </blockquote>

                  <p class="">Do we have any targets with instructions

                    that can actually use the start value? TBH I'd be

                    tempted to suggest we just make the initial

                    extractelement/fadd/insertelement pattern a manual

                    extra stage and avoid having having that argument

                    entirely.<span class="Apple-converted-space"> </span><br

                      class="">

                  </p>

                </blockquote>

              </div>

            </blockquote>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

              ARM SVE has the FADDA instruction for strict fadd

              reductions (see for example test/MC/AArch64/SVE/fadda.s).

              This instruction takes an explicit start-value operand.

              The reduction intrinsics were originally introduced for

              SVE where we modelled the fadd/fmul reductions with this

              instruction in mind.</div>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

              <br class="">

            </div>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal;" class=""><font class="" face="Helvetica Neue">Just

                to clarify, is this what you are suggesting regarding

                extract/fadd/insert?<br class="">

                <br class="">

                  %first = extractelement <4 x float> %input, i32

                0<br class="">

                  %first.new = fadd float %start, %first<br class="">

                  %input.new = insertelement <4 x float> %input,

                float %first.new, i32 0<br class="">

                  %red = call float

                @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4

                x float> %input.new)<br class="">

                <br class="">

                My only reservation here is that LLVM might obfuscate

                this code so that CodeGen couldn't easily match the

                extract/fadd/insert pattern, thus adding the extra fadd

                instruction. This could for example happen if the loop

                would be rotated/pipelined to load the next iteration

                and doing the first 'fadd' before the next iteration. </font><span

                style="font-family: "Helvetica Neue";"

                class="">In such case having the extra operand would be

                more descriptive.</span></div>

          </div>

        </div>

      </div>

    </blockquote>

    <p>Yes that was the IR I had in mind, but you're right in that its

      probably useful for chained fadd reductions as well as the SVE

      specific instruction. If we're getting rid of the fast math

      'undef' special case and we expect a 'identity' start value (fadd

      = 0.0f, fmul = 1.0f) that we can optimize away then I've no

      objections.</p>

    <blockquote type="cite"

      cite="mid:BB39E23A-CC39-4638-97E7-42EDC563E311@arm.com">

      <div class="">

        <div class="">

          <div>

            <blockquote type="cite" class="">

              <div class="">

                <blockquote type="cite"

                  cite="mid:d306cf98-1225-732d-8016-7e882b5136b1@redking.me.uk"

                  style="font-family: Helvetica; font-size: 12px;

                  font-style: normal; font-variant-caps: normal;

                  font-weight: normal; letter-spacing: normal; orphans:

                  auto; text-align: start; text-indent: 0px;

                  text-transform: none; white-space: normal; widows:

                  auto; word-spacing: 0px; -webkit-text-size-adjust:

                  auto; -webkit-text-stroke-width: 0px;

                  background-color: rgb(255, 255, 255); text-decoration:

                  none;" class="">

                  <blockquote type="cite"

                    cite="mid:67D8F282-9E37-473F-9973-AA981D992711@arm.com"

                    class="">

                    <div class="WordSection1" style="page:

                      WordSection1;">

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">Further

                          efforts:<o:p class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">----------------------------<o:p

                            class=""></o:p></span></div>

                      <div style="margin: 0cm 0cm 0.0001pt; font-size:

                        12pt; font-family: Calibri, sans-serif;"

                        class="">

                        <span style="font-size: 11pt;" class="">Here a

                          non-exhaustive list of items I think work

                          towards making the intrinsics

                          non-experimental:</span><span

                          style="font-size: 11pt; font-family: "MS

                          Gothic";" class="" lang="EN-US"> </span><span

                          style="font-size: 11pt;" class=""><o:p

                            class=""></o:p></span></div>

                      <ul style="margin-bottom: 0cm; margin-top: 0cm;"

                        class="" type="disc">

                        <li class="MsoListParagraph" style="margin: 0cm

                          0cm 0.0001pt -18pt; font-size: 12pt;

                          font-family: Calibri, sans-serif;">

                          <span style="font-size: 11pt;" class="">Adding

                            SelectionDAG legalization for the  _STRICT

                            reduction SDNodes. After some great work

                            from Nikita in D58015, unordered reductions

                            are now legalized/expanded in SelectionDAG,

                            so if we add expansion in SelectionDAG for

                            strict reductions this would make the

                            ExpandReductionsPass redundant.<o:p class=""></o:p></span></li>

                        <li class="MsoListParagraph" style="margin: 0cm

                          0cm 0.0001pt -18pt; font-size: 12pt;

                          font-family: Calibri, sans-serif;">

                          <span style="font-size: 11pt;" class="">Better

                            enforcing the constraints of the intrinsics

                            (see<span class="Apple-converted-space"> </span><a

                              href="https://reviews.llvm.org/D60260"

                              moz-do-not-send="true" style="color:

                              rgb(149, 79, 114); text-decoration:

                              underline;" class="">https://reviews.llvm.org/D60260</a><span

                              class="Apple-converted-space"> </span>).</span><span

                            style="font-size: 11pt;" class=""

                            lang="EN-US"> </span><span style="font-size:

                            11pt;" class=""><o:p class=""></o:p></span></li>

                        <li class="MsoListParagraph" style="margin: 0cm

                          0cm 0.0001pt -18pt; font-size: 12pt;

                          font-family: Calibri, sans-serif;">

                          <span style="font-size: 11pt;" class="">I

                            think we'll also want to be able to overload

                            the result operand based on the vector

                            element type for the intrinsics having the

                            constraint that the result type must match

                            the vector element type. e.g. dropping the

                            redundant 'i32' in:</span><span

                            style="font-size: 11pt;" class=""><br

                              class="">

                             <span class="Apple-converted-space"> </span></span><span

                            style="font-size: 11pt;" class="">i32

                            @llvm.experimental.vector.reduce.and.i32.v4i32(<4

                            x i32> %a) => i32

                            @llvm.experimental.vector.reduce.and.v4i32(<4

                            x i32> %a)<o:p class=""></o:p></span></li>

                      </ul>

                      <div style="margin: 0cm 0cm 0.0001pt 18pt;

                        font-size: 12pt; font-family: Calibri,

                        sans-serif;" class="">

                        <span style="font-size: 11pt;" class="">since

                          i32 is implied by <4 x i32>. This would

                          have the added benefit that LLVM would

                          automatically check for the operands to match.</span><span

                          style="font-size: 11pt; font-family: "MS

                          Gothic";" class="" lang="EN-US"> </span></div>

                    </div>

                  </blockquote>

                  <p class="">Won't this cause issues with overflow?

                    Isn't the point  of an add (or mul....) reduction of

                    say, <64 x i8> giving a larger (i32 or i64)

                    result so we don't lose anything? I agree for bitop

                    reductions it doesn't make sense though.<br class="">

                  </p>

                </blockquote>

                <span style="caret-color: rgb(0, 0, 0); font-family:

                  Helvetica; font-size: 12px; font-style: normal;

                  font-variant-caps: normal; font-weight: normal;

                  letter-spacing: normal; text-align: start;

                  text-indent: 0px; text-transform: none; white-space:

                  normal; word-spacing: 0px; -webkit-text-stroke-width:

                  0px; background-color: rgb(255, 255, 255);

                  text-decoration: none; float: none; display: inline

                  !important;" class="">Sorry - I forgot to add: which

                  asks the question - should we be considering

                  signed/unsigned add/mul and possibly saturation

                  reductions?</span><br style="caret-color: rgb(0, 0,

                  0); font-family: Helvetica; font-size: 12px;

                  font-style: normal; font-variant-caps: normal;

                  font-weight: normal; letter-spacing: normal;

                  text-align: start; text-indent: 0px; text-transform:

                  none; white-space: normal; word-spacing: 0px;

                  -webkit-text-stroke-width: 0px; background-color:

                  rgb(255, 255, 255); text-decoration: none;" class="">

              </div>

            </blockquote>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

              The current intrinsics explicitly specify that:</div>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

                 "The return type matches the element-type of the vector

              input"</div>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

              <br class="">

            </div>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

              This was done to avoid having explicit signed/unsigned add

              reductions, reasoning that zero- and sign-extension can be

              done on the input values to the reduction. We had a bit of

              debate on this internally, and it would come down to

              similar reasons as for the extra 'start value' operand to

              fadd reductions. I think we'd welcome the signed/unsigned

              variants as they would be more descriptive and would

              safeguard the code from transformations that make it

              difficult to fold the sign/zero extend into the operation

              during CodeGen. The downside however is that for

              signed/unsigned add reductions it would mean that both

              operations are the same when the result type equals the

              element type.</div>

          </div>

        </div>

      </div>

    </blockquote>

    <p>An alternative would be that we limit the existing add/mul cases

      to the same result type (along with

      and/or/xor/smax/smin/umax/umin) and we add sadd/uadd/smul/umul

      extending reductions as well.</p>

    <blockquote type="cite"

      cite="mid:BB39E23A-CC39-4638-97E7-42EDC563E311@arm.com">

      <div class="">

        <div class="">

          <div>

            <div style="margin: 0px; font-stretch: normal; line-height:

              normal; font-family: "Helvetica Neue";" class="">

              <div style="margin: 0px; font-stretch: normal;

                line-height: normal;" class="">Saturating vector

                reductions sound sensible, but are there any targets

                that implement these at the moment?</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    X86/SSE has the v8i16 HADDS/HSUBS horizontal signed saturation

    instructions, and X86/XOP has extend+horizontal-add/sub instructions

    (<a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/XOP_instruction_set">https://en.wikipedia.org/wiki/XOP_instruction_set</a>).<br>

  </body>

</html>