<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Nicolas - have you investigated just using inline asm instead?<br>

    </p>

    <div class="moz-cite-prefix">On 11/11/2021 08:34, Wang, Pengfei via

      llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:PH0PR11MB5627C7B528EF0BFA2AED73E188949@PH0PR11MB5627.namprd11.prod.outlook.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style>@font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face

        {font-family:"\@SimSun";

        panose-1:2 1 6 0 3 1 1 1 1 1;}p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}span.EmailStyle20

        {mso-style-type:personal-compose;

        font-family:"Calibri",sans-serif;

        color:windowtext;

        font-weight:normal;

        font-style:normal;

        text-decoration:none none;}.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;}div.WordSection1

        {page:WordSection1;}</style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal">>As I am very new to this part of LLVM,

          I am not sure what is feasible or not. Would it be

          envisionnable to either:<o:p></o:p></p>

        <p class="MsoNormal">>1. have a way to inject some numeric

          cost to influence the value of some resulting combinations?<o:p></o:p></p>

        <p class="MsoNormal">>2. revive some form of intrinsic and

          guarantee that the instruction would be generated?<o:p></o:p></p>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">I think a

            feasible way is to add a new tuningXXX feature for given

            targets and do something different with the flag in the

            combine.<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">1) seems

            overengineering and 2) seems overkilled for potential

            opportunities by the combine.<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">Thanks<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D">Phoebe<o:p></o:p></span></p>

        <p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>

        <div style="border:none;border-top:solid #E1E1E1

          1.0pt;padding:3.0pt 0in 0in 0in">

          <p class="MsoNormal"><b>From:</b> llvm-dev

            <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev-bounces@lists.llvm.org"><llvm-dev-bounces@lists.llvm.org></a> <b>On Behalf Of

            </b>Nicolas Vasilache via llvm-dev<br>

            <b>Sent:</b> Wednesday, November 10, 2021 5:46 PM<br>

            <b>To:</b> Diego Caballero <a class="moz-txt-link-rfc2396E" href="mailto:diegocaballero@google.com"><diegocaballero@google.com></a><br>

            <b>Cc:</b> <a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

            <b>Subject:</b> Re: [llvm-dev] Understanding and controlling

            some of the AVX shuffle emission paths<o:p></o:p></p>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <p class="MsoNormal"><o:p> </o:p></p>

          <div>

            <div>

              <p class="MsoNormal">On Wed, Nov 10, 2021 at 10:30 AM

                Diego Caballero <<a

                  href="mailto:diegocaballero@google.com"

                  moz-do-not-send="true" class="moz-txt-link-freetext">diegocaballero@google.com</a>>

                wrote:<o:p></o:p></p>

            </div>

            <blockquote style="border:none;border-left:solid #CCCCCC

              1.0pt;padding:0in 0in 0in

              6.0pt;margin-left:4.8pt;margin-right:0in">

              <div>

                <p class="MsoNormal"><a href="mailto:ntv@google.com"

                    target="_blank" moz-do-not-send="true">+Nicolas

                    Vasilache</a> :)<o:p></o:p></p>

              </div>

            </blockquote>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Thanks Diego, email is hard, I could

                not find ways to inject myself into my own discussion...<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"> <o:p></o:p></p>

            </div>

            <blockquote style="border:none;border-left:solid #CCCCCC

              1.0pt;padding:0in 0in 0in

              6.0pt;margin-left:4.8pt;margin-right:0in">

              <p class="MsoNormal"><o:p> </o:p></p>

              <div>

                <div>

                  <p class="MsoNormal">On Tue, Nov 9, 2021 at 10:32 PM

                    Simon Pilgrim via llvm-dev <<a

                      href="mailto:llvm-dev@lists.llvm.org"

                      target="_blank" moz-do-not-send="true"

                      class="moz-txt-link-freetext">llvm-dev@lists.llvm.org</a>>

                    wrote:<o:p></o:p></p>

                </div>

                <blockquote style="border:none;border-left:solid #CCCCCC

                  1.0pt;padding:0in 0in 0in

                  6.0pt;margin-left:4.8pt;margin-right:0in">

                  <p class="MsoNormal">On 09/11/2021 20:44, Simon

                    Pilgrim wrote:<br>

                    <br>

                    > On 09/11/2021 08:57, Nicolas Vasilache via

                    llvm-dev wrote:<br>

                    >> Hi everyone,<br>

                    >><br>

                    >> I am experimenting with LLVM lowering,

                    intrinsics and shufflevector <br>

                    >> in general.<br>

                    >><br>

                    >> Here is an IR that I produce with the

                    objective of emitting some <br>

                    >> vblendps instructions: <br>

                    >> <a

href="https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a"

                      target="_blank" moz-do-not-send="true"

                      class="moz-txt-link-freetext">

https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a</a>.

                    <br>

                    >><br>

                    > From what I can see, the original IR code was

                    (effectively):<br>

                    ><br>

                    > 8 x UNPCKLPS/UNPCKHPS<br>

                    > 4 x SHUFPS<br>

                    > 8 x BLENDPS<br>

                    > 4 x INSERTF128<br>

                    > 4 x PERM2F128<br>

                    ><br>

                    >> I compile this further with<br>

                    >><br>

                    >> clang -x ir -emit-llvm -S -mcpu=haswell -O3

                    -o - | llc -O3 <br>

                    >> -mcpu=haswell - -o -<br>

                    >><br>

                    >> to obtain:<br>

                    >><br>

                    >> <a

href="https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a"

                      target="_blank" moz-do-not-send="true"

                      class="moz-txt-link-freetext">

https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a</a>

                    <br>

                    >><br>

                    ><br>

                    > and after the x86 shuffle combines:<br>

                    ><br>

                    > 8 x UNPCKLPS/UNPCKHPS<br>

                    > 8 x UNPCKLPD/UNPCKHPD<br>

                    > 4 x INSERTF128<br>

                    > 4 x PERM2F128<br>

                    ><br>

                    > Starting from each BLENDPS, they've combined

                    with the SHUFPS to create <br>

                    > the UNPCK*PD nodes. We nearly always benefit

                    from folding shuffle <br>

                    > chains to reduce total instruction counts, even

                    if some inner nodes <br>

                    > have multiple uses (like the SHUFPS), and I'd

                    hate to lose that.<br>

                    ><br>

                    >> At this point, I would expect to see some

                    vblendps instructions <br>

                    >> generated for the pieces of IR that produce

                    %48/%49 %51/%52 %54/%55 <br>

                    >> and %57/%58 to reduce pressure on port 5

                    (vblendps can also go on <br>

                    >> ports 0 and 1). However the expected

                    instruction does not get <br>

                    >> generated and llvm-mca continues to show me

                    high port 5 contention.<br>

                    >><br>

                    >> Could people suggest some steps / commands

                    to help better understand <br>

                    >> why my expectation is not met and whether I

                    can do something to make <br>

                    >> the compiler generate what I want? Thanks

                    in advance!<br>

                    > So on Haswell, we've gained 4 extra Port5-only

                    shuffles but removed <br>

                    > the 8 Port015 blends.<br>

                    ><br>

                    > We have very little arch-specific shuffle

                    combines, just the <br>

                    > fast-variable-shuffle tuning flags to avoid

                    unnecessary shuffle mask <br>

                    > loads, the shuffle combines just aims for the

                    reduction in simple <br>

                    > target shuffle nodes. And tbh I'm reluctant to

                    add to this as shuffle <br>

                    > combining is complex already.<br>

                    ><br>

                    > We should be preferring to lower/combine to

                    BLENDPS in more <br>

                    > circumstances (its commutable and never slower

                    than any other target <br>

                    > shuffle, although demanded elts can do less

                    with 'undef' elements), <br>

                    > but that won't help us here.<br>

                    ><br>

                    > So far I've failed to find a BLEND-based 8x8

                    transpose pattern that <br>

                    > the shuffle combiner doesn't manage to combine

                    back to the <br>

                    > 8xUNPCK/SHUFPS ops :(<o:p></o:p></p>

                </blockquote>

              </div>

            </blockquote>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">If you are referring to this specific

                code, yes same for me.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">If you are thinking about the general

                8x8 transpose problem, I have a version with

                vector<4xf32> loads that ends up using blends; as

                expected, the port 5 pressure reduction helps and both

                llvm-mca and runtime agree that this is 20-30% faster.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"> <o:p></o:p></p>

            </div>

            <blockquote style="border:none;border-left:solid #CCCCCC

              1.0pt;padding:0in 0in 0in

              6.0pt;margin-left:4.8pt;margin-right:0in">

              <div>

                <blockquote style="border:none;border-left:solid #CCCCCC

                  1.0pt;padding:0in 0in 0in

                  6.0pt;margin-left:4.8pt;margin-right:0in">

                  <p class="MsoNormal"><br>

                    The only thing I can think of is you might want to

                    see if you can <br>

                    reorder the INSERTF128/PERM2F128 shuffles in between

                    the UNPACK*PS and <br>

                    the SHUFPS/BLENDPS:<br>

                    <br>

                    8 x UNPCKLPS/UNPCKHPS<br>

                    4 x INSERTF128<br>

                    4 x PERM2F128<br>

                    4 x SHUFPS<br>

                    8 x BLENDPS<br>

                    <br>

                    Splitting the per-lane shuffles with the

                    subvector-shuffles could help <br>

                    stop the shuffle combiner.<o:p></o:p></p>

                </blockquote>

              </div>

            </blockquote>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Right, I tried different variations

                here but invariably getting the same result.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">The vector<4xf32> based version

                is something that I also want to target for a bunch of

                orthogonal reasons.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">I'll note that my use case is MLIR

                codegen with explicit vectors and intrinsics -> LLVM

                so I have quite some flexibility.<o:p></o:p></p>

            </div>

            <p class="MsoNormal">But it feels unnatural in the compiler

              flow to have to branch off at a significant higher-level

              of abstraction to sidestep concerns related to X86

              microarchitecture details.<o:p></o:p></p>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">As I am very new to this part of

                LLVM, I am not sure what is feasible or not. Would it be

                envisionnable to either:<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">1. have a way to inject some numeric

                cost to influence the value of some resulting

                combinations?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">2. revive some form of intrinsic and

                guarantee that the instruction would be generated?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">I realize point 2. is contrary to the

                evolution of LLVM as these intrinsics were removed ca.

                2015 in favor of the combiner-based approach.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Still it seems that `we have very

                little arch-specific shuffle combines` could be the

                signal that such intrinsics are needed?<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"> <o:p></o:p></p>

            </div>

            <blockquote style="border:none;border-left:solid #CCCCCC

              1.0pt;padding:0in 0in 0in

              6.0pt;margin-left:4.8pt;margin-right:0in">

              <div>

                <blockquote style="border:none;border-left:solid #CCCCCC

                  1.0pt;padding:0in 0in 0in

                  6.0pt;margin-left:4.8pt;margin-right:0in">

                  <p class="MsoNormal"><br>

                    >> I have verified independently that in

                    isolation, a single such <br>

                    >> shuffle creates a vblendps. I see them

                    being recombined in the <br>

                    >> produced assembly and I am looking for

                    experimenting with avoiding <br>

                    >> that vshufps + vblendps + vblendps get

                    recombined into vunpckxxx + <br>

                    >> vunpckxxx instructions.<br>

                    >><br>

                    >> -- <br>

                    _______________________________________________<br>

                    LLVM Developers mailing list<br>

                    <a href="mailto:llvm-dev@lists.llvm.org"

                      target="_blank" moz-do-not-send="true"

                      class="moz-txt-link-freetext">llvm-dev@lists.llvm.org</a><br>

                    <a

                      href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"

                      target="_blank" moz-do-not-send="true"

                      class="moz-txt-link-freetext">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>

                </blockquote>

              </div>

            </blockquote>

          </div>

          <p class="MsoNormal"><br clear="all">

            <o:p></o:p></p>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <p class="MsoNormal">-- <o:p></o:p></p>

          <div>

            <div>

              <p class="MsoNormal">N<o:p></o:p></p>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

  </body>

</html>