<div dir="ltr">I think we can think this in another way.<div><br></div><div>For modern CPU architectures which supports store forwarding with store queues, it is generally not "safe" to blindly do local optimizations to widen the load/stores without sophisticated inter-procedural analysis. Doing so will run the risk of greatly reduce performance of some programs. Keep the naturally aligned load/store using its natural type is safer. </div><div><br></div><div>Does it make sense?</div><div><br></div><div>David</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000"><span class="">
    <p><br>
    </p>
    <div class="m_4449805550791560709moz-cite-prefix">On 09/03/2017 10:38 PM, Xinliang David
      Li wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">Store forwarding stall cost is usually much higher
        compared with a load hitting L1 cache. For instance, on Haswell,
         the latter is ~4 cycles, while the store forwarding stalls cost
        about 10 cycles more than a successful store forwarding, which
        is roughly 15 cycles. In some scenarios, the store forwarding
        stalls can be as high as 50 cycles. See Agner's documentation. 
        <br>
      </div>
    </blockquote>
    <br></span>
    I understand. As I understand it, there are two potential ways to
    fix this problem:<br>
    <br>
     1. You can make the store wider (to match the size of the wide
    load, thus permitting forwarding).<br>
     2. You can make the load smaller (to match the size of the small
    store, thus permitting forwarding).<br>
    <br>
    At least in this benchmark, which is a better solution?<br>
    <br>
    Thanks again,<br>
    Hal<div><div class="h5"><br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div><br>
        </div>
        <div>In other words, the optimizer needs to be taught to avoid
          defeating  the HW pipeline feature as much as possible.</div>
        <div><br>
        </div>
        <div>David</div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Sun, Sep 3, 2017 at 6:32 PM, Hal
          Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div class="m_4449805550791560709HOEnZb">
              <div class="m_4449805550791560709h5"><br>
                On 09/03/2017 03:44 PM, Wei Mi wrote:<br>
                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                  On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>> wrote:<br>
                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                    On 08/22/2017 10:56 PM, Wei Mi via llvm-commits
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li
                      <<a href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>><br>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                        <br>
                        On Tue, Aug 22, 2017 at 6:37 PM, Chandler
                        Carruth via Phabricator<br>
                        <<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>>
                        wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          chandlerc added a comment.<br>
                          <br>
                          I'm really not a fan of the degree of
                          complexity and subtlety that this<br>
                          introduces into the frontend, all to allow
                          particular backend<br>
                          optimizations.<br>
                          <br>
                          I feel like this is Clang working around a
                          fundamental deficiency in<br>
                          LLVM<br>
                          and we should instead find a way to fix this
                          in LLVM itself.<br>
                          <br>
                          As has been pointed out before, user code can
                          synthesize large integers<br>
                          that small bit sequences are extracted from,
                          and Clang and LLVM should<br>
                          handle those just as well as actual bitfields.<br>
                          <br>
                          Can we see how far we can push the LLVM side
                          before we add complexity to<br>
                          Clang here? I understand that there remain
                          challenges to LLVM's stuff,<br>
                          but I<br>
                          don't think those challenges make *all* of the
                          LLVM improvements off the<br>
                          table, I don't think we've exhausted all ways
                          of improving the LLVM<br>
                          changes<br>
                          being proposed, and I think we should still
                          land all of those and<br>
                          re-evaluate how important these issues are
                          when all of that is in place.<br>
                        </blockquote>
                        <br>
                        The main challenge of doing  this in LLVM is
                        that inter-procedural<br>
                        analysis<br>
                        (and possibly cross module) is needed (for store
                        forwarding issues).<br>
                        <br>
                        Wei, perhaps you can provide concrete test case
                        to illustrate the issue<br>
                        so<br>
                        that reviewers have a good understanding.<br>
                        <br>
                        David<br>
                      </blockquote>
                      Here is a runable testcase:<br>
                      -------------------- 1.cc ------------------------<br>
                      class A {<br>
                      public:<br>
                          unsigned long f1:2;<br>
                          unsigned long f2:6;<br>
                          unsigned long f3:8;<br>
                          unsigned long f4:4;<br>
                      };<br>
                      A a;<br>
                      unsigned long b;<br>
                      unsigned long N = 1000000000;<br>
                      <br>
                      __attribute__((noinline))<br>
                      void foo() {<br>
                          a.f3 = 3;<br>
                      }<br>
                      <br>
                      __attribute__((noinline))<br>
                      void goo() {<br>
                          b = a.f3;<br>
                      }<br>
                      <br>
                      int main() {<br>
                          unsigned long i;<br>
                          for (i = 0; i < N; i++) {<br>
                            foo();<br>
                            goo();<br>
                          }<br>
                      }<br>
                      ------------------------------<wbr>------------------------------<br>
                      Now trunk takes about twice running time compared
                      with trunk + this<br>
                      patch. That is because trunk shrinks the store of
                      a.f3 in foo (Done by<br>
                      DagCombiner) but not shrink the load of a.f3 in
                      goo, so store<br>
                      forwarding will be blocked.<br>
                    </blockquote>
                    <br>
                    I can confirm that this is true on Haswell and also
                    on an POWER8.<br>
                    Interestingly, on a POWER7, the performance is the
                    same either way (on the<br>
                    good side). I ran the test case as presented and
                    where I replaced f3 with a<br>
                    non-bitfield unsigned char member. Thinking that the
                    POWER7 result might be<br>
                    because it's big-Endian, I flipped the order of the
                    fields, and found that<br>
                    the version where f3 is not a bitfield is faster
                    than otherwise, but only by<br>
                    12.5%.<br>
                    <br>
                    Why, in this case, don't we shrink the load? It
                    seems like we should (and it<br>
                    seems like a straightforward case).<br>
                    <br>
                    Thanks again,<br>
                    Hal<br>
                    <br>
                  </blockquote>
                  Hal, thanks for trying the test.<br>
                  <br>
                  Yes, it is straightforward to shrink the load in the
                  test. I can<br>
                  change the testcase a little to show why it is
                  sometime difficult to<br>
                  shrink the load:<br>
                  <br>
                  class A {<br>
                  public:<br>
                     unsigned long f1:16;<br>
                     unsigned long f2:16;<br>
                     unsigned long f3:16;<br>
                     unsigned long f4:8;<br>
                  };<br>
                  A a;<br>
                  bool b;<br>
                  unsigned long N = 1000000000;<br>
                  <br>
                  __attribute__((noinline))<br>
                  void foo() {<br>
                     a.f4 = 3;<br>
                  }<br>
                  <br>
                  __attribute__((noinline))<br>
                  void goo() {<br>
                     b = (a.f4 == 0 && a.f3 == (unsigned
                  short)-1);<br>
                  }<br>
                  <br>
                  int main() {<br>
                     unsigned long i;<br>
                     for (i = 0; i < N; i++) {<br>
                       foo();<br>
                       goo();<br>
                     }<br>
                  }<br>
                  <br>
                  For the load a.f4 in goo, it is diffcult to motivate
                  its shrink after<br>
                  instcombine because the comparison with a.f3 and the
                  comparison with<br>
                  a.f4 are merged:<br>
                  <br>
                  define void @_Z3goov() local_unnamed_addr #0 {<br>
                     %1 = load i64, i64* bitcast (%class.A* @a to i64*),
                  align 8<br>
                     %2 = and i64 %1, 0xffffff00000000<br>
                     %3 = icmp eq i64 %2, 0xffff00000000<br>
                     %4 = zext i1 %3 to i8<br>
                     store i8 %4, i8* @b, align 1, !tbaa !2<br>
                     ret void<br>
                  }<br>
                </blockquote>
                <br>
              </div>
            </div>
            Exactly. But this behavior is desirable, locally. There's no
            good answer here: We either generate extra load traffic here
            (because we need to load the fields separately), or we widen
            the store (which generates extra load traffic there). Do you
            know, in terms of performance, which is better in this case
            (i.e., is it better to widen the store or split the load)?<span class="m_4449805550791560709HOEnZb"><font color="#888888"><br>
                <br>
                 -Hal</font></span>
            <div class="m_4449805550791560709HOEnZb">
              <div class="m_4449805550791560709h5"><br>
                <br>
                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                  <br>
                  Thanks,<br>
                  Wei.<br>
                  <br>
                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      The testcases shows the potential problem of store
                      shrinking. Before<br>
                      we decide to do store shrinking, we need to know
                      all the related loads<br>
                      will be shrunk,  and that requires IPA analysis.
                      Otherwise, when load<br>
                      shrinking was blocked for some difficult case
                      (Like the instcombine<br>
                      case described in<br>
                      <a href="https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html" rel="noreferrer" target="_blank">https://www.mail-archive.com/c<wbr>fe-commits@lists.llvm.org/msg6<wbr>5085.html</a>),<br>
                      performance regression will happen.<br>
                      <br>
                      Wei.<br>
                      <br>
                      <br>
                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          <br>
                          Repository:<br>
                              rL LLVM<br>
                          <br>
                          <a href="https://reviews.llvm.org/D36562" rel="noreferrer" target="_blank">https://reviews.llvm.org/D3656<wbr>2</a><br>
                          <br>
                          <br>
                          <br>
                        </blockquote>
                      </blockquote>
                      ______________________________<wbr>_________________<br>
                      llvm-commits mailing list<br>
                      <a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>
                      <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>
                    </blockquote>
                    <br>
                    --<br>
                    Hal Finkel<br>
                    Lead, Compiler Technology and Programming Languages<br>
                    Leadership Computing Facility<br>
                    Argonne National Laboratory<br>
                    <br>
                  </blockquote>
                </blockquote>
                <br>
                -- <br>
                Hal Finkel<br>
                Lead, Compiler Technology and Programming Languages<br>
                Leadership Computing Facility<br>
                Argonne National Laboratory<br>
                <br>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
    <pre class="m_4449805550791560709moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </div></div></div>

</blockquote></div><br></div>