<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000"><span class="">
    <p><br>
    </p>
    <div class="m_-5573162088652840693moz-cite-prefix">On 09/03/2017 11:06 PM, Xinliang David
      Li wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">I think we can think this in another way.
        <div><br>
        </div>
        <div>For modern CPU architectures which supports store
          forwarding with store queues, it is generally not "safe" to
          blindly do local optimizations to widen the load/stores</div>
      </div>
    </blockquote>
    <br></span>
    Why not widen stores? Generally the problem with store forwarding is
    where the load is wider than the store (plus alignment issues).<span class=""><br><br></span></div></blockquote><div><br></div><div>True, but probably with some caveats which are target dependent.  Store widening also requires additional bit operations (and possibly addition load), so the it is tricky to model the the overall benefit. </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><span class="">
    <blockquote type="cite">
      <div dir="ltr">
        <div> without sophisticated inter-procedural analysis. Doing so
          will run the risk of greatly reduce performance of some
          programs. Keep the naturally aligned load/store using its
          natural type is safer. </div>
        <div><br>
        </div>
        <div>Does it make sense?</div>
      </div>
    </blockquote>
    <br></span>
    It makes sense. I could, however, say the same thing about inlining.
    We need to make inlining decisions locally, but they have global
    impact. Nevertheless, we need to make inlining decisions, and
    there's no practical way to make them in a truly non-local way.<br></div></blockquote><div><br></div><div>Speaking of inlining, we are actually thinking of ways to make the decisions more globally optimal, but that is off topic.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    <br>
    We also don't pessimize common cases to improve performance in rare
    cases. In the common case, reducing pressure on the memory units,
    and reducing the critical path, seem likely to me to be optimal. If
    that's not true, or doing otherwise has negligible cost (but can
    prevent rare downsides), we should certainly consider those options.</div></blockquote><div><br></div><div>Since we don't do load widening for non-bitfield cases (but the only the very limited case of naturally aligned bitfields) so it is hard to say we pessimize common cases for rare cases:</div><div><br></div><div>1) the upside doing widening such access is not high from experience with other compiler (which does not do so)</div><div>2) there is obvious downside of introducing additional extract instructions which degrades performance</div><div>3) there is obvious downside of severely degrading performance when store forwarding is blocked.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    <br>
    And none of this answers the question of whether it's better to have
    the store wider or the load split and narrow.<br></div></blockquote><div><br></div><div><br></div><div>It seems safer to do store widening more aggressively to avoid store forwarding stall issue, but doing this aggressively may also mean other runtime overhead introduced (extra load, data merge etc).</div><div><br></div><div>Thanks,</div><div><br></div><div>David</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    <br>
    Thanks again,<br>
    Hal<div><div class="h5"><br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div><br>
        </div>
        <div>David</div>
        <div><br>
        </div>
        <div><br>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Sun, Sep 3, 2017 at 8:55 PM, Hal
          Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"><span>
                <p><br>
                </p>
                <div class="m_-5573162088652840693m_4449805550791560709moz-cite-prefix">On
                  09/03/2017 10:38 PM, Xinliang David Li wrote:<br>
                </div>
                <blockquote type="cite">
                  <div dir="ltr">Store forwarding stall cost is usually
                    much higher compared with a load hitting L1 cache.
                    For instance, on Haswell,  the latter is ~4 cycles,
                    while the store forwarding stalls cost about 10
                    cycles more than a successful store forwarding,
                    which is roughly 15 cycles. In some scenarios, the
                    store forwarding stalls can be as high as 50 cycles.
                    See Agner's documentation.  <br>
                  </div>
                </blockquote>
                <br>
              </span> I understand. As I understand it, there are two
              potential ways to fix this problem:<br>
              <br>
               1. You can make the store wider (to match the size of the
              wide load, thus permitting forwarding).<br>
               2. You can make the load smaller (to match the size of
              the small store, thus permitting forwarding).<br>
              <br>
              At least in this benchmark, which is a better solution?<br>
              <br>
              Thanks again,<br>
              Hal
              <div>
                <div class="m_-5573162088652840693h5"><br>
                  <br>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div><br>
                      </div>
                      <div>In other words, the optimizer needs to be
                        taught to avoid defeating  the HW pipeline
                        feature as much as possible.</div>
                      <div><br>
                      </div>
                      <div>David</div>
                    </div>
                    <div class="gmail_extra"><br>
                      <div class="gmail_quote">On Sun, Sep 3, 2017 at
                        6:32 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>
                        wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          <div class="m_-5573162088652840693m_4449805550791560709HOEnZb">
                            <div class="m_-5573162088652840693m_4449805550791560709h5"><br>
                              On 09/03/2017 03:44 PM, Wei Mi wrote:<br>
                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On Sat,
                                Sep 2, 2017 at 6:04 PM, Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>
                                wrote:<br>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On 08/22/2017
                                  10:56 PM, Wei Mi via llvm-commits
                                  wrote:<br>
                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On Tue, Aug
                                    22, 2017 at 7:03 PM, Xinliang David
                                    Li <<a href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>><br>
                                    wrote:<br>
                                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br>
                                      On Tue, Aug 22, 2017 at 6:37 PM,
                                      Chandler Carruth via Phabricator<br>
                                      <<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>>
                                      wrote:<br>
                                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                        chandlerc added a comment.<br>
                                        <br>
                                        I'm really not a fan of the
                                        degree of complexity and
                                        subtlety that this<br>
                                        introduces into the frontend,
                                        all to allow particular backend<br>
                                        optimizations.<br>
                                        <br>
                                        I feel like this is Clang
                                        working around a fundamental
                                        deficiency in<br>
                                        LLVM<br>
                                        and we should instead find a way
                                        to fix this in LLVM itself.<br>
                                        <br>
                                        As has been pointed out before,
                                        user code can synthesize large
                                        integers<br>
                                        that small bit sequences are
                                        extracted from, and Clang and
                                        LLVM should<br>
                                        handle those just as well as
                                        actual bitfields.<br>
                                        <br>
                                        Can we see how far we can push
                                        the LLVM side before we add
                                        complexity to<br>
                                        Clang here? I understand that
                                        there remain challenges to
                                        LLVM's stuff,<br>
                                        but I<br>
                                        don't think those challenges
                                        make *all* of the LLVM
                                        improvements off the<br>
                                        table, I don't think we've
                                        exhausted all ways of improving
                                        the LLVM<br>
                                        changes<br>
                                        being proposed, and I think we
                                        should still land all of those
                                        and<br>
                                        re-evaluate how important these
                                        issues are when all of that is
                                        in place.<br>
                                      </blockquote>
                                      <br>
                                      The main challenge of doing  this
                                      in LLVM is that inter-procedural<br>
                                      analysis<br>
                                      (and possibly cross module) is
                                      needed (for store forwarding
                                      issues).<br>
                                      <br>
                                      Wei, perhaps you can provide
                                      concrete test case to illustrate
                                      the issue<br>
                                      so<br>
                                      that reviewers have a good
                                      understanding.<br>
                                      <br>
                                      David<br>
                                    </blockquote>
                                    Here is a runable testcase:<br>
                                    -------------------- 1.cc
                                    ------------------------<br>
                                    class A {<br>
                                    public:<br>
                                        unsigned long f1:2;<br>
                                        unsigned long f2:6;<br>
                                        unsigned long f3:8;<br>
                                        unsigned long f4:4;<br>
                                    };<br>
                                    A a;<br>
                                    unsigned long b;<br>
                                    unsigned long N = 1000000000;<br>
                                    <br>
                                    __attribute__((noinline))<br>
                                    void foo() {<br>
                                        a.f3 = 3;<br>
                                    }<br>
                                    <br>
                                    __attribute__((noinline))<br>
                                    void goo() {<br>
                                        b = a.f3;<br>
                                    }<br>
                                    <br>
                                    int main() {<br>
                                        unsigned long i;<br>
                                        for (i = 0; i < N; i++) {<br>
                                          foo();<br>
                                          goo();<br>
                                        }<br>
                                    }<br>
                                    ------------------------------<wbr>------------------------------<br>
                                    Now trunk takes about twice running
                                    time compared with trunk + this<br>
                                    patch. That is because trunk shrinks
                                    the store of a.f3 in foo (Done by<br>
                                    DagCombiner) but not shrink the load
                                    of a.f3 in goo, so store<br>
                                    forwarding will be blocked.<br>
                                  </blockquote>
                                  <br>
                                  I can confirm that this is true on
                                  Haswell and also on an POWER8.<br>
                                  Interestingly, on a POWER7, the
                                  performance is the same either way (on
                                  the<br>
                                  good side). I ran the test case as
                                  presented and where I replaced f3 with
                                  a<br>
                                  non-bitfield unsigned char member.
                                  Thinking that the POWER7 result might
                                  be<br>
                                  because it's big-Endian, I flipped the
                                  order of the fields, and found that<br>
                                  the version where f3 is not a bitfield
                                  is faster than otherwise, but only by<br>
                                  12.5%.<br>
                                  <br>
                                  Why, in this case, don't we shrink the
                                  load? It seems like we should (and it<br>
                                  seems like a straightforward case).<br>
                                  <br>
                                  Thanks again,<br>
                                  Hal<br>
                                  <br>
                                </blockquote>
                                Hal, thanks for trying the test.<br>
                                <br>
                                Yes, it is straightforward to shrink the
                                load in the test. I can<br>
                                change the testcase a little to show why
                                it is sometime difficult to<br>
                                shrink the load:<br>
                                <br>
                                class A {<br>
                                public:<br>
                                   unsigned long f1:16;<br>
                                   unsigned long f2:16;<br>
                                   unsigned long f3:16;<br>
                                   unsigned long f4:8;<br>
                                };<br>
                                A a;<br>
                                bool b;<br>
                                unsigned long N = 1000000000;<br>
                                <br>
                                __attribute__((noinline))<br>
                                void foo() {<br>
                                   a.f4 = 3;<br>
                                }<br>
                                <br>
                                __attribute__((noinline))<br>
                                void goo() {<br>
                                   b = (a.f4 == 0 && a.f3 ==
                                (unsigned short)-1);<br>
                                }<br>
                                <br>
                                int main() {<br>
                                   unsigned long i;<br>
                                   for (i = 0; i < N; i++) {<br>
                                     foo();<br>
                                     goo();<br>
                                   }<br>
                                }<br>
                                <br>
                                For the load a.f4 in goo, it is diffcult
                                to motivate its shrink after<br>
                                instcombine because the comparison with
                                a.f3 and the comparison with<br>
                                a.f4 are merged:<br>
                                <br>
                                define void @_Z3goov()
                                local_unnamed_addr #0 {<br>
                                   %1 = load i64, i64* bitcast
                                (%class.A* @a to i64*), align 8<br>
                                   %2 = and i64 %1, 0xffffff00000000<br>
                                   %3 = icmp eq i64 %2, 0xffff00000000<br>
                                   %4 = zext i1 %3 to i8<br>
                                   store i8 %4, i8* @b, align 1, !tbaa
                                !2<br>
                                   ret void<br>
                                }<br>
                              </blockquote>
                              <br>
                            </div>
                          </div>
                          Exactly. But this behavior is desirable,
                          locally. There's no good answer here: We
                          either generate extra load traffic here
                          (because we need to load the fields
                          separately), or we widen the store (which
                          generates extra load traffic there). Do you
                          know, in terms of performance, which is better
                          in this case (i.e., is it better to widen the
                          store or split the load)?<span class="m_-5573162088652840693m_4449805550791560709HOEnZb"><font color="#888888"><br>
                              <br>
                               -Hal</font></span>
                          <div class="m_-5573162088652840693m_4449805550791560709HOEnZb">
                            <div class="m_-5573162088652840693m_4449805550791560709h5"><br>
                              <br>
                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br>
                                Thanks,<br>
                                Wei.<br>
                                <br>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> The
                                    testcases shows the potential
                                    problem of store shrinking. Before<br>
                                    we decide to do store shrinking, we
                                    need to know all the related loads<br>
                                    will be shrunk,  and that requires
                                    IPA analysis. Otherwise, when load<br>
                                    shrinking was blocked for some
                                    difficult case (Like the instcombine<br>
                                    case described in<br>
                                    <a href="https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html" rel="noreferrer" target="_blank">https://www.mail-archive.com/c<wbr>fe-commits@lists.llvm.org/msg6<wbr>5085.html</a>),<br>
                                    performance regression will happen.<br>
                                    <br>
                                    Wei.<br>
                                    <br>
                                    <br>
                                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br>
                                        Repository:<br>
                                            rL LLVM<br>
                                        <br>
                                        <a href="https://reviews.llvm.org/D36562" rel="noreferrer" target="_blank">https://reviews.llvm.org/D3656<wbr>2</a><br>
                                        <br>
                                        <br>
                                        <br>
                                      </blockquote>
                                    </blockquote>
                                    ______________________________<wbr>_________________<br>
                                    llvm-commits mailing list<br>
                                    <a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>
                                    <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>
                                  </blockquote>
                                  <br>
                                  --<br>
                                  Hal Finkel<br>
                                  Lead, Compiler Technology and
                                  Programming Languages<br>
                                  Leadership Computing Facility<br>
                                  Argonne National Laboratory<br>
                                  <br>
                                </blockquote>
                              </blockquote>
                              <br>
                              -- <br>
                              Hal Finkel<br>
                              Lead, Compiler Technology and Programming
                              Languages<br>
                              Leadership Computing Facility<br>
                              Argonne National Laboratory<br>
                              <br>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      <br>
                    </div>
                  </blockquote>
                  <br>
                  <pre class="m_-5573162088652840693m_4449805550791560709moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
    <pre class="m_-5573162088652840693moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </div></div></div>

</blockquote></div><br></div></div>