<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 03/09/2017 05:42 PM, Michael

      Kuperstein wrote:<br>

    </div>

    <blockquote

cite="mid:CAL_y90myzA8aDKztq9XTPybaQLaat+gK1_i-fdh0JaA1kOV8AQ@mail.gmail.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div dir="ltr">I think it'd be nice to keep CGP as "a grab-bag of

        relatively simple transformations that don't require/preserve

        any complex analyses". 

        <div><br>

        </div>

        <div>Note that that's not quite where we're at right now - while

          CGP doesn't seem to use AA at all, it does have *one*

          transformation that uses LI and DT -

          eliminateMostlyEmptyBlocks. And it constructs those on demand.

          :-\</div>

      </div>

    </blockquote>

    <br>

    :(<br>

    <br>

    <blockquote

cite="mid:CAL_y90myzA8aDKztq9XTPybaQLaat+gK1_i-fdh0JaA1kOV8AQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>We may want to split that out as well.</div>

        <div><br>

        </div>

        <div>(The "ModifiedDT" flag is a bit of red herring - IIUC, what

          it really means is ModifiedCFG, I think the only place that

          uses it is the optimizeBlock() iteration - and that probably

          ought to be using a worklist instead...)<br>

        </div>

      </div>

    </blockquote>

    <br>

    I certainly don't mind breaking things into separate passes so long

    as there aren't phase ordering problems. I'll point out, however,

    that:<br>

     - If they're relatively-simple transformations, then updating the

    DT is probably not that hard (obviously I could have missed

    something, but I skimmed the code that this seems to be the case)<br>

     - There are certainly things that are simple to do, but only if you

    have some proper analysis results. People tend to do hacky things to

    work around not having analysis results and that's not something we

    want to encourage.<br>

    <br>

    Thus, I'd still vote for fixing CGP to preserve DT.<br>

    <br>

     -Hal<br>

    <br>

    <blockquote

cite="mid:CAL_y90myzA8aDKztq9XTPybaQLaat+gK1_i-fdh0JaA1kOV8AQ@mail.gmail.com"

      type="cite">

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Thu, Mar 9, 2017 at 1:49 PM, Wei Mi

          <span dir="ltr"><<a moz-do-not-send="true"

              href="mailto:wmi@google.com" target="_blank">wmi@google.com</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div class="HOEnZb">

              <div class="h5">On Thu, Mar 9, 2017 at 10:54 AM, Hal

                Finkel <<a moz-do-not-send="true"

                  href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>>

                wrote:<br>

                > On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:<br>

                >><br>

                >> In<br>

                >> <a moz-do-not-send="true"

href="http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html"

                  rel="noreferrer" target="_blank">http://lists.llvm.org/<wbr>pipermail/cfe-commits/Week-of-<wbr>Mon-20120827/063200.html</a>,<br>

                >> consecutive bitfields are wrapped as a group

                and represented as a<br>

                >> large integer and emits loads stores and bit

                operations appropriate<br>

                >> for extracting bits from within it. It fixes

                the problem of violating<br>

                >> C++11 memory model that original widen

                load/store of bitfield was<br>

                >> facing. It also brings more coalescing

                opportunities across bitfields.<br>

                >><br>

                >> If some bitfields are natural aligned and their

                num of bits can form a<br>

                >> legal integer type that the target supports, it

                is more efficient to<br>

                >> access them directly than doing a large integer

                load/store plus a<br>

                >> series of bit operations. We call such reverse

                transformation legal<br>

                >> type bitfield shrinking. Currently, llvm

                depends on DAGCombiner to do<br>

                >> such shrinking.<br>

                >><br>

                >> However, DAGCombiner has the

                one-basic-block-a-time limitation, so we<br>

                >> started to implement a new shrinking

                optimization in<br>

                >> <a moz-do-not-send="true"

                  href="https://reviews.llvm.org/D30416"

                  rel="noreferrer" target="_blank">https://reviews.llvm.org/<wbr>D30416</a>,

                and initially put it in instcombine,<br>

                >> then moved it to CGP because we want to use

                some TargetLowering<br>

                >> information.<br>

                >><br>

                >> The initial implementation in D30416 only

                supports load-and-or-store<br>

                >> pattern matching, and it uses a inst-by-inst

                scan as a safety check to<br>

                >> make sure there is no other memory write insn

                between the load and<br>

                >> store (no alias query is done).  After getting

                the initial<br>

                >> implementation, we found more problems:

                EarlyCSE, LoadPRE and even<br>

                >> InstCombine itself can do coalescing before the

                shrinking (LoadPRE<br>

                >> does it the most thoroughly).<br>

                >> The coalescing can move the load many

                BasicBlocks earlier and make<br>

                >> simple inst-by-inst scan unscalable and

                oftentimes fail. It also<br>

                >> breaks the load-and-or-store pattern. An

                example is below:<br>

                >><br>

                >> Before coalescing done by earlycse/loadpre:<br>

                >> %bf.load = load i64, i64* %0, align 8<br>

                >> %bf.clear = and i64 %bf.load, -65536<br>

                >> %bf.set = or i64 %bf.value, %bf.clear<br>

                >> store i64 %bf.set2, i64* %9, align 8<br>

                >> .....<br>

                >> %bf.load1 = load i64, i64* %0, align 8<br>

                >> %bf.clear1 = and i64 %bf.load1, -4294901761<br>

                >> %bf.set1 = or i64 %bf.value1, %bf.clear1<br>

                >> store i64 %bf.set2, i64* %9, align 8<br>

                >> .....<br>

                >> %bf.load2 = load i64, i64* %0, align 8<br>

                >> %bf.clear2 = and i64 %bf.load2, -4294901761<br>

                >> %bf.set2 = or i64 %bf.value2, %bf.clear2<br>

                >> store i64 %bf.set2, i64* %9, align 8<br>

                >><br>

                >> After coalescing, it will become:<br>

                >> %bf.load = load i64, i64* %0, align 8<br>

                >> %bf.clear = and i64 %bf.load, -65536<br>

                >> %bf.set = or i64 %bf.value, %bf.clear<br>

                >> .....<br>

                >> %bf.clear1 = and i64 %bf.set, -4294901761<br>

                >> %bf.set1 = or i64 %bf.value1, %bf.clear1<br>

                >> .....<br>

                >> %bf.clear2 = and i64 %bf.set1, -4294901761<br>

                >> %bf.set2 = or i64 %bf.value2, %bf.clear2<br>

                >> store i64 %bf.set2, i64* %9, align 8<br>

                >><br>

                >> After load-store coalescing, %bf.load now is

                far away from the store,<br>

                >> and the previous load-and-or-store pattern

                disappears.<br>

                >><br>

                >> A simple idea to fix it is to move the

                shrinking in a very early pass<br>

                >> before the first pass of EarlyCSE. However, if

                we move shrinking<br>

                >> earlier, it is possible to block the coalescing

                of other ilegal type<br>

                >> bitfields which can not be shrinked. So for

                coalescing and shrinking,<br>

                >> no matter which one is first, it will block the

                other one.<br>

                >><br>

                >> After some discussions with Eli and Michael, I

                come up with another<br>

                >> idea to let shrinking stay in the late

                pipeline. It needs two changes<br>

                >> to the current patch:<br>

                >><br>

                >> 1. extending the pattern match to handle

                store(or(and(or(and...))<br>

                >> pattern above. It needs to analyze the bit mask

                of every and-or pairs.<br>

                >> If the last and-or pair touch different section

                with the other and-or<br>

                >> pairs, we can split the original store into

                two, and do the shrinking<br>

                >> for the second store, like below<br>

                >><br>

                >> %bf.load = load i64, i64* %0, align 8<br>

                >> %bf.clear = and i64 %bf.load, -65536<br>

                >> %bf.set = or i64 %bf.value, %bf.clear<br>

                >> .....<br>

                >><br>

                >> %bf.clear1 = and i64 %bf.set, -4294901761<br>

                >> %bf.set1 = or i64 %bf.value1, %bf.clear1<br>

                >> store i64 %bf.set1, i64* %0, align 8           

                             // the first<br>

                >> store.<br>

                >> .....<br>

                >><br>

                >> %bf.value2.shrinked = shrink_op %bf.value2<br>

                >> store i16 %bf.value2.shrinked, i64* %0, align

                8       // shrinked store.<br>

                >><br>

                >> 2. use memoryssa + alias query to do the safety

                check. Because<br>

                >> dominator tree is not properly updated in CGP,

                I have to create a new<br>

                >> pass and put it before CGP, in order to use

                memoryssa.<br>

                ><br>

                ><br>

                > This makes sense to me. Should we just fix CGP to

                update the DT instead of<br>

                > working around it?<br>

                ><br>

                >  -Hal<br>

                <br>

              </div>

            </div>

            I am not familiar enough with CGP to tell. I just notice

            ModifiedDT is<br>

            modified at several places in CGP, which indicates there are

            a few<br>

            transformations needing their specific dominator tree

            maintainance<br>

            work. And I remember simplifyCFG also doesn't use dominator

            tree to<br>

            save the effort to maintain it on the fly. So maybe it is

            easier to<br>

            separate CGP into two parts: which does need dominator tree

            and<br>

            doesn't. Those transformations which don't need dominator

            tree can<br>

            stay together.<br>

            <br>

            Currently, I know consthoisting is another potential CGP<br>

            transformation but left out because of its need of dominator

            tree. But<br>

            consthoisting has already evolved to contain a substantial

            amount of<br>

            code so may be better to stay as a separate pass.<br>

            <br>

            Thanks,<br>

            Wei.<br>

            <div class="HOEnZb">

              <div class="h5"><br>

                ><br>

                >><br>

                >> Eli suggested me to ask for more opinions

                before start writting code.<br>

                >> I think it is a good idea and here is the post.

                Comments are<br>

                >> appreciated.<br>

                >><br>

                >> Thanks,<br>

                >> Wei.<br>

                >> ______________________________<wbr>_________________<br>

                >> LLVM Developers mailing list<br>

                >> <a moz-do-not-send="true"

                  href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

                >> <a moz-do-not-send="true"

                  href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"

                  rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

                ><br>

                ><br>

                > --<br>

                > Hal Finkel<br>

                > Lead, Compiler Technology and Programming Languages<br>

                > Leadership Computing Facility<br>

                > Argonne National Laboratory<br>

                ><br>

              </div>

            </div>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>