<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>If we do load combining at the IR level, one thing we'll need to
      give some thought to is atomicity.  Combining two atomic loads
      into a wider (legal) atomic load is not a reversible
      transformation given our current specification.</p>
    <p>I've been thinking about a concept I've been tentatively calling
      "element wise atomicity" which would make this a reversible
      transform by representing the load as either a vector or struct
      load on which the atomicity requirement was clearly expressed on
      the individual elements, not the vector/struct itself.  (We have
      no way to spell that today.)</p>
    <p>Not having load combining be a reversible transformation is
      really unfortunate.  Consider this small example:</p>
    <p>store i32 0, a+4<br>
      load i32, a<br>
      load i32, a+4</p>
    <p>If you happen to visit the load pair first, you really want to be
      able to split them again once you discover the store.  <br>
    </p>
    <p>Note that the same theoretical problem exists in
      MI/SelectionDAG.  We just do so much less memory optimization
      there we get away with mostly not caring.  :)</p>
    <p>Philip<br>
    </p>
    <div class="moz-cite-prefix">On 9/24/2019 4:50 PM, Artur Pilipenko
      via llvm-dev wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:6ED8F220-12AD-4AE0-BF55-18D3D845515C@azul.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      Implementing load combine/widening on the IR level sounds like a
      reasonable enhancement to me. Although, we (Azul) don't have any
      plans of doing that in the near future.
      <div class=""><br class="">
      </div>
      <div class="">
        <div class="">Artur<br class="">
          <div><br class="">
            <blockquote type="cite" class="">
              <div class="">On 12 Sep 2019, at 00:58, Paweł Bylica <<a
                  href="mailto:chfast@gmail.com" class=""
                  moz-do-not-send="true">chfast@gmail.com</a>> wrote:</div>
              <br class="Apple-interchange-newline">
              <div class="">
                <div dir="ltr" class="">
                  <div class="">Ok, thanks.</div>
                  <div class=""><br class="">
                  </div>
                  <div class="">Are there any plans to reintroduce it on
                    the IR level? I'm not confident this is strictly
                    necessary, but in some cases not having load
                    widening ends up really bad.</div>
                  <div class="">Like in the case where vectorizer tries
                    to do something about it:</div>
                  <div class=""><a href="https://godbolt.org/z/60RuEw"
                      class="" moz-do-not-send="true">https://godbolt.org/z/60RuEw</a></div>
                  <div class=""><a
                      href="https://bugs.llvm.org/show_bug.cgi?id=42708"
                      class="" moz-do-not-send="true">https://bugs.llvm.org/show_bug.cgi?id=42708</a></div>
                  <div class=""><br class="">
                  </div>
                  <div class="">At the current state I'm forced to use
                    memset() to express uint64 load from an array of
                    bytes.</div>
                  <div class=""><br class="">
                  </div>
                  <div class="">// P.<br class="">
                  </div>
                </div>
                <br class="">
                <div class="gmail_quote">
                  <div dir="ltr" class="gmail_attr">On Thu, Sep 12, 2019
                    at 4:22 AM Artur Pilipenko <<a
                      href="mailto:apilipenko@azul.com" class=""
                      moz-do-not-send="true">apilipenko@azul.com</a>>
                    wrote:<br class="">
                  </div>
                  <blockquote class="gmail_quote" style="margin:0px 0px
                    0px 0.8ex;border-left:1px solid
                    rgb(204,204,204);padding-left:1ex">
                    <div style="overflow-wrap: break-word;" class="">Hi,
                      <div class=""><br class="">
                      </div>
                      <div class="">Load widening/combine for the
                        original pattern was implemented in DAG
                        combiner. See:</div>
                      <div class=""><a
href="https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872"
                          target="_blank" class=""
                          moz-do-not-send="true">https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872</a></div>
                      <div class=""><br class="">
                      </div>
                      <div class="">Later a similar transform was
                        introduced fro stores:</div>
                      <div class=""><a
href="https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700"
                          target="_blank" class=""
                          moz-do-not-send="true">https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700</a></div>
                      <div class=""><br class="">
                      </div>
                      <div class="">There is no load widening on the IR
                        level as of now.</div>
                      <div class=""><br class="">
                      </div>
                      <div class="">Artur<br class="">
                        <div class=""><br class="">
                          <blockquote type="cite" class="">
                            <div class="">On 11 Sep 2019, at 04:47,
                              Paweł Bylica <<a
                                href="mailto:chfast@gmail.com"
                                target="_blank" class=""
                                moz-do-not-send="true">chfast@gmail.com</a>>
                              wrote:</div>
                            <br
                              class="gmail-m_8191889820642647725Apple-interchange-newline">
                            <div class="">
                              <div dir="ltr" class="">
                                <div class="">Hi,</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">Can I ask what is the
                                  status of load widening. It seems
                                  there is no load widening on IR at
                                  all.</div>
                                <div class=""><br class="">
                                </div>
                                <div class="">// Paweł<br class="">
                                </div>
                              </div>
                              <br class="">
                              <div class="gmail_quote">
                                <div dir="ltr" class="gmail_attr">On
                                  Wed, Oct 5, 2016 at 1:49 PM Artur
                                  Pilipenko via llvm-dev <<a
                                    href="mailto:llvm-dev@lists.llvm.org"
                                    target="_blank" class=""
                                    moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
                                  wrote:<br class="">
                                </div>
                                <blockquote class="gmail_quote"
                                  style="margin:0px 0px 0px
                                  0.8ex;border-left:1px solid
                                  rgb(204,204,204);padding-left:1ex">
                                  Philip and I talked about this is
                                  person. Given the fact that load
                                  widening in presence of atomics is
                                  irreversible transformation we agreed
                                  that we don't want to do this early.
                                  For now it can be implemented as a
                                  peephole optimization over machine IR.
                                  MIR is preferred here because X86
                                  backend does GEP reassociation at MIR
                                  level and it can make information
                                  about addresses being adjacent
                                  available.<br class="">
                                  <br class="">
                                  Inline cost of the original pattern is
                                  a valid concern and we might want to
                                  revisit our decision later. But in
                                  order to do widening earlier we need
                                  to have a way to undo this
                                  transformation.<br class="">
                                  <br class="">
                                  I’m going to try implementing MIR
                                  optimization but not in the immediate
                                  future.<br class="">
                                  <br class="">
                                  Artur<br class="">
                                  <br class="">
                                  > On 28 Sep 2016, at 19:32, Artur
                                  Pilipenko <<a
                                    href="mailto:apilipenko@azulsystems.com"
                                    target="_blank" class=""
                                    moz-do-not-send="true">apilipenko@azulsystems.com</a>>
                                  wrote:<br class="">
                                  > <br class="">
                                  > One of the arguments for doing
                                  this earlier is inline cost perception
                                  of the original pattern. Reading
                                  i32/i64 by bytes look much more
                                  expensive than it is and can prevent
                                  inlining of interesting function.<br
                                    class="">
                                  > <br class="">
                                  > Inhibiting other optimizations
                                  concern can be addressed by careful
                                  selection of the pattern we’d like to
                                  match. I limit the transformation to
                                  the case when all the individual have
                                  no uses other than forming a wider
                                  load. In this case it’s less likely to
                                  loose information during this
                                  transformation.<br class="">
                                  > <br class="">
                                  > I didn’t think of atomicity
                                  aspect though.<br class="">
                                  > <br class="">
                                  > Artur<br class="">
                                  > <br class="">
                                  >> On 28 Sep 2016, at 18:50,
                                  Philip Reames <<a
                                    href="mailto:listmail@philipreames.com"
                                    target="_blank" class=""
                                    moz-do-not-send="true">listmail@philipreames.com</a>>
                                  wrote:<br class="">
                                  >> <br class="">
                                  >> There's a bit of additional
                                  context worth adding here...<br
                                    class="">
                                  >> <br class="">
                                  >> Up until very recently, we
                                  had a form of widening implemented in
                                  GVN.  We decided to remove this in
                                  <a
                                    href="https://reviews.llvm.org/D24096"
                                    rel="noreferrer" target="_blank"
                                    class="" moz-do-not-send="true">
                                    https://reviews.llvm.org/D24096</a>
                                  precisely because its placement in the
                                  pass pipeline was inhibiting other
                                  optimizations. There's also a major
                                  problem with doing widening at the IR
                                  level which is that widening a pair of
                                  atomic loads into a single wider
                                  atomic load can not be undone. This
                                  creates a major pass ordering problem
                                  of its own.<br class="">
                                  >> <br class="">
                                  >> At this point, my general
                                  view is that widening transformations
                                  of any kind should be done very late. 
                                  Ideally, this is something the backend
                                  would do, but doing it as a CGP like
                                  fixup pass over the IR is also
                                  reasonable.<br class="">
                                  >> <br class="">
                                  >> With that in mind, I feel
                                  both the current placement of
                                  LoadCombine (within the inliner
                                  iteration) and the proposed
                                  InstCombine rule are undesirable.<br
                                    class="">
                                  >> <br class="">
                                  >> Philip<br class="">
                                  >> <br class="">
                                  >> <br class="">
                                  >> On 09/28/2016 08:22 AM, Artur
                                  Pilipenko wrote:<br class="">
                                  >>> Hi,<br class="">
                                  >>> <br class="">
                                  >>> I'm trying to optimize a
                                  pattern like this into a single i16
                                  load:<br class="">
                                  >>> %1 = bitcast i16* %pData
                                  to i8*<br class="">
                                  >>> %2 = load i8, i8* %1,
                                  align 1<br class="">
                                  >>> %3 = zext i8 %2 to i16<br
                                    class="">
                                  >>> %4 = shl nuw i16 %3, 8<br
                                    class="">
                                  >>> %5 = getelementptr
                                  inbounds i8, i8* %1, i16 1<br class="">
                                  >>> %6 = load i8, i8* %5,
                                  align 1<br class="">
                                  >>> %7 = zext i8 %6 to i16<br
                                    class="">
                                  >>> %8 = shl nuw nsw i16 %7,
                                  0<br class="">
                                  >>> %9 = or i16 %8, %4<br
                                    class="">
                                  >>> <br class="">
                                  >>> I came across load
                                  combine pass which is motivated by
                                  virtualliy the same pattern. Load
                                  combine optimizes the pattern by
                                  combining adjacent loads into one load
                                  and lets the rest of the optimizer
                                  cleanup the rest. From what I see on
                                  the initial review for load combine (<a
href="https://reviews.llvm.org/D3580" rel="noreferrer" target="_blank"
                                    class="" moz-do-not-send="true">https://reviews.llvm.org/D3580</a>)
                                  it was not enabled by default because
                                  it caused some performance
                                  regressions. It's not very surprising,
                                  I see how this type of widening can
                                  obscure some facts for the rest of the
                                  optimizer.<br class="">
                                  >>> <br class="">
                                  >>> I can't find any
                                  backstory for this pass, why was it
                                  chosen to optimize the pattern in
                                  question in this way? What is the
                                  current status of this pass?<br
                                    class="">
                                  >>> <br class="">
                                  >>> I have an alternative
                                  implementation for it locally. I
                                  implemented an instcombine rule
                                  similar to recognise bswap/bitreverse
                                  idiom. It relies on collectBitParts
                                  (Local.cpp) to determine the origin of
                                  the bits in a given or value. If all
                                  the bits are happen to be loaded from
                                  adjacent locations it replaces the or
                                  with a single load or a load plus
                                  bswap.<br class="">
                                  >>> <br class="">
                                  >>> If the alternative
                                  approach sounds reasonable I'll post
                                  my patches for review.<br class="">
                                  >>> <br class="">
                                  >>> Artur<br class="">
                                  >> <br class="">
                                  > <br class="">
                                  <br class="">
_______________________________________________<br class="">
                                  LLVM Developers mailing list<br
                                    class="">
                                  <a
                                    href="mailto:llvm-dev@lists.llvm.org"
                                    target="_blank" class=""
                                    moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br
                                    class="">
                                  <a
                                    href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"
                                    rel="noreferrer" target="_blank"
                                    class="" moz-do-not-send="true">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br
                                    class="">
                                </blockquote>
                              </div>
                            </div>
                          </blockquote>
                        </div>
                        <br class="">
                      </div>
                    </div>
                  </blockquote>
                </div>
              </div>
            </blockquote>
          </div>
          <br class="">
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
    </blockquote>
  </body>
</html>