<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>If we do load combining at the IR level, one thing we'll need to

      give some thought to is atomicity.  Combining two atomic loads

      into a wider (legal) atomic load is not a reversible

      transformation given our current specification.</p>

    <p>I've been thinking about a concept I've been tentatively calling

      "element wise atomicity" which would make this a reversible

      transform by representing the load as either a vector or struct

      load on which the atomicity requirement was clearly expressed on

      the individual elements, not the vector/struct itself.  (We have

      no way to spell that today.)</p>

    <p>Not having load combining be a reversible transformation is

      really unfortunate.  Consider this small example:</p>

    <p>store i32 0, a+4<br>

      load i32, a<br>

      load i32, a+4</p>

    <p>If you happen to visit the load pair first, you really want to be

      able to split them again once you discover the store.  <br>

    </p>

    <p>Note that the same theoretical problem exists in

      MI/SelectionDAG.  We just do so much less memory optimization

      there we get away with mostly not caring.  :)</p>

    <p>Philip<br>

    </p>

    <div class="moz-cite-prefix">On 9/24/2019 4:50 PM, Artur Pilipenko

      via llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:6ED8F220-12AD-4AE0-BF55-18D3D845515C@azul.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      Implementing load combine/widening on the IR level sounds like a

      reasonable enhancement to me. Although, we (Azul) don't have any

      plans of doing that in the near future.

      <div class=""><br class="">

      </div>

      <div class="">

        <div class="">Artur<br class="">

          <div><br class="">

            <blockquote type="cite" class="">

              <div class="">On 12 Sep 2019, at 00:58, Paweł Bylica <<a

                  href="mailto:chfast@gmail.com" class=""

                  moz-do-not-send="true">chfast@gmail.com</a>> wrote:</div>

              <br class="Apple-interchange-newline">

              <div class="">

                <div dir="ltr" class="">

                  <div class="">Ok, thanks.</div>

                  <div class=""><br class="">

                  </div>

                  <div class="">Are there any plans to reintroduce it on

                    the IR level? I'm not confident this is strictly

                    necessary, but in some cases not having load

                    widening ends up really bad.</div>

                  <div class="">Like in the case where vectorizer tries

                    to do something about it:</div>

                  <div class=""><a href="https://godbolt.org/z/60RuEw"

                      class="" moz-do-not-send="true">https://godbolt.org/z/60RuEw</a></div>

                  <div class=""><a

                      href="https://bugs.llvm.org/show_bug.cgi?id=42708"

                      class="" moz-do-not-send="true">https://bugs.llvm.org/show_bug.cgi?id=42708</a></div>

                  <div class=""><br class="">

                  </div>

                  <div class="">At the current state I'm forced to use

                    memset() to express uint64 load from an array of

                    bytes.</div>

                  <div class=""><br class="">

                  </div>

                  <div class="">// P.<br class="">

                  </div>

                </div>

                <br class="">

                <div class="gmail_quote">

                  <div dir="ltr" class="gmail_attr">On Thu, Sep 12, 2019

                    at 4:22 AM Artur Pilipenko <<a

                      href="mailto:apilipenko@azul.com" class=""

                      moz-do-not-send="true">apilipenko@azul.com</a>>

                    wrote:<br class="">

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px

                    0px 0.8ex;border-left:1px solid

                    rgb(204,204,204);padding-left:1ex">

                    <div style="overflow-wrap: break-word;" class="">Hi,

                      <div class=""><br class="">

                      </div>

                      <div class="">Load widening/combine for the

                        original pattern was implemented in DAG

                        combiner. See:</div>

                      <div class=""><a

href="https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872"

                          target="_blank" class=""

                          moz-do-not-send="true">https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872</a></div>

                      <div class=""><br class="">

                      </div>

                      <div class="">Later a similar transform was

                        introduced fro stores:</div>

                      <div class=""><a

href="https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700"

                          target="_blank" class=""

                          moz-do-not-send="true">https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700</a></div>

                      <div class=""><br class="">

                      </div>

                      <div class="">There is no load widening on the IR

                        level as of now.</div>

                      <div class=""><br class="">

                      </div>

                      <div class="">Artur<br class="">

                        <div class=""><br class="">

                          <blockquote type="cite" class="">

                            <div class="">On 11 Sep 2019, at 04:47,

                              Paweł Bylica <<a

                                href="mailto:chfast@gmail.com"

                                target="_blank" class=""

                                moz-do-not-send="true">chfast@gmail.com</a>>

                              wrote:</div>

                            <br

                              class="gmail-m_8191889820642647725Apple-interchange-newline">

                            <div class="">

                              <div dir="ltr" class="">

                                <div class="">Hi,</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">Can I ask what is the

                                  status of load widening. It seems

                                  there is no load widening on IR at

                                  all.</div>

                                <div class=""><br class="">

                                </div>

                                <div class="">// Paweł<br class="">

                                </div>

                              </div>

                              <br class="">

                              <div class="gmail_quote">

                                <div dir="ltr" class="gmail_attr">On

                                  Wed, Oct 5, 2016 at 1:49 PM Artur

                                  Pilipenko via llvm-dev <<a

                                    href="mailto:llvm-dev@lists.llvm.org"

                                    target="_blank" class=""

                                    moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

                                  wrote:<br class="">

                                </div>

                                <blockquote class="gmail_quote"

                                  style="margin:0px 0px 0px

                                  0.8ex;border-left:1px solid

                                  rgb(204,204,204);padding-left:1ex">

                                  Philip and I talked about this is

                                  person. Given the fact that load

                                  widening in presence of atomics is

                                  irreversible transformation we agreed

                                  that we don't want to do this early.

                                  For now it can be implemented as a

                                  peephole optimization over machine IR.

                                  MIR is preferred here because X86

                                  backend does GEP reassociation at MIR

                                  level and it can make information

                                  about addresses being adjacent

                                  available.<br class="">

                                  <br class="">

                                  Inline cost of the original pattern is

                                  a valid concern and we might want to

                                  revisit our decision later. But in

                                  order to do widening earlier we need

                                  to have a way to undo this

                                  transformation.<br class="">

                                  <br class="">

                                  I’m going to try implementing MIR

                                  optimization but not in the immediate

                                  future.<br class="">

                                  <br class="">

                                  Artur<br class="">

                                  <br class="">

                                  > On 28 Sep 2016, at 19:32, Artur

                                  Pilipenko <<a

                                    href="mailto:apilipenko@azulsystems.com"

                                    target="_blank" class=""

                                    moz-do-not-send="true">apilipenko@azulsystems.com</a>>

                                  wrote:<br class="">

                                  > <br class="">

                                  > One of the arguments for doing

                                  this earlier is inline cost perception

                                  of the original pattern. Reading

                                  i32/i64 by bytes look much more

                                  expensive than it is and can prevent

                                  inlining of interesting function.<br

                                    class="">

                                  > <br class="">

                                  > Inhibiting other optimizations

                                  concern can be addressed by careful

                                  selection of the pattern we’d like to

                                  match. I limit the transformation to

                                  the case when all the individual have

                                  no uses other than forming a wider

                                  load. In this case it’s less likely to

                                  loose information during this

                                  transformation.<br class="">

                                  > <br class="">

                                  > I didn’t think of atomicity

                                  aspect though.<br class="">

                                  > <br class="">

                                  > Artur<br class="">

                                  > <br class="">

                                  >> On 28 Sep 2016, at 18:50,

                                  Philip Reames <<a

                                    href="mailto:listmail@philipreames.com"

                                    target="_blank" class=""

                                    moz-do-not-send="true">listmail@philipreames.com</a>>

                                  wrote:<br class="">

                                  >> <br class="">

                                  >> There's a bit of additional

                                  context worth adding here...<br

                                    class="">

                                  >> <br class="">

                                  >> Up until very recently, we

                                  had a form of widening implemented in

                                  GVN.  We decided to remove this in

                                  <a

                                    href="https://reviews.llvm.org/D24096"

                                    rel="noreferrer" target="_blank"

                                    class="" moz-do-not-send="true">

                                    https://reviews.llvm.org/D24096</a>

                                  precisely because its placement in the

                                  pass pipeline was inhibiting other

                                  optimizations. There's also a major

                                  problem with doing widening at the IR

                                  level which is that widening a pair of

                                  atomic loads into a single wider

                                  atomic load can not be undone. This

                                  creates a major pass ordering problem

                                  of its own.<br class="">

                                  >> <br class="">

                                  >> At this point, my general

                                  view is that widening transformations

                                  of any kind should be done very late. 

                                  Ideally, this is something the backend

                                  would do, but doing it as a CGP like

                                  fixup pass over the IR is also

                                  reasonable.<br class="">

                                  >> <br class="">

                                  >> With that in mind, I feel

                                  both the current placement of

                                  LoadCombine (within the inliner

                                  iteration) and the proposed

                                  InstCombine rule are undesirable.<br

                                    class="">

                                  >> <br class="">

                                  >> Philip<br class="">

                                  >> <br class="">

                                  >> <br class="">

                                  >> On 09/28/2016 08:22 AM, Artur

                                  Pilipenko wrote:<br class="">

                                  >>> Hi,<br class="">

                                  >>> <br class="">

                                  >>> I'm trying to optimize a

                                  pattern like this into a single i16

                                  load:<br class="">

                                  >>> %1 = bitcast i16* %pData

                                  to i8*<br class="">

                                  >>> %2 = load i8, i8* %1,

                                  align 1<br class="">

                                  >>> %3 = zext i8 %2 to i16<br

                                    class="">

                                  >>> %4 = shl nuw i16 %3, 8<br

                                    class="">

                                  >>> %5 = getelementptr

                                  inbounds i8, i8* %1, i16 1<br class="">

                                  >>> %6 = load i8, i8* %5,

                                  align 1<br class="">

                                  >>> %7 = zext i8 %6 to i16<br

                                    class="">

                                  >>> %8 = shl nuw nsw i16 %7,

                                  0<br class="">

                                  >>> %9 = or i16 %8, %4<br

                                    class="">

                                  >>> <br class="">

                                  >>> I came across load

                                  combine pass which is motivated by

                                  virtualliy the same pattern. Load

                                  combine optimizes the pattern by

                                  combining adjacent loads into one load

                                  and lets the rest of the optimizer

                                  cleanup the rest. From what I see on

                                  the initial review for load combine (<a

href="https://reviews.llvm.org/D3580" rel="noreferrer" target="_blank"

                                    class="" moz-do-not-send="true">https://reviews.llvm.org/D3580</a>)

                                  it was not enabled by default because

                                  it caused some performance

                                  regressions. It's not very surprising,

                                  I see how this type of widening can

                                  obscure some facts for the rest of the

                                  optimizer.<br class="">

                                  >>> <br class="">

                                  >>> I can't find any

                                  backstory for this pass, why was it

                                  chosen to optimize the pattern in

                                  question in this way? What is the

                                  current status of this pass?<br

                                    class="">

                                  >>> <br class="">

                                  >>> I have an alternative

                                  implementation for it locally. I

                                  implemented an instcombine rule

                                  similar to recognise bswap/bitreverse

                                  idiom. It relies on collectBitParts

                                  (Local.cpp) to determine the origin of

                                  the bits in a given or value. If all

                                  the bits are happen to be loaded from

                                  adjacent locations it replaces the or

                                  with a single load or a load plus

                                  bswap.<br class="">

                                  >>> <br class="">

                                  >>> If the alternative

                                  approach sounds reasonable I'll post

                                  my patches for review.<br class="">

                                  >>> <br class="">

                                  >>> Artur<br class="">

                                  >> <br class="">

                                  > <br class="">

                                  <br class="">

_______________________________________________<br class="">

                                  LLVM Developers mailing list<br

                                    class="">

                                  <a

                                    href="mailto:llvm-dev@lists.llvm.org"

                                    target="_blank" class=""

                                    moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br

                                    class="">

                                  <a

                                    href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"

                                    rel="noreferrer" target="_blank"

                                    class="" moz-do-not-send="true">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br

                                    class="">

                                </blockquote>

                              </div>

                            </div>

                          </blockquote>

                        </div>

                        <br class="">

                      </div>

                    </div>

                  </blockquote>

                </div>

              </div>

            </blockquote>

          </div>

          <br class="">

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

  </body>

</html>