<div dir="ltr">FYI, a commit is up for the inline asm change: <a href="https://reviews.llvm.org/D114335">https://reviews.llvm.org/D114335</a>.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Nov 14, 2021 at 11:16 PM Nicolas Vasilache <<a href="mailto:ntv@google.com">ntv@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Not yet, the InlineAsmOp in MLIR is still generally unused.<div>It has been used a bit in the IREE project though (<a href="https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103" target="_blank">https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103</a>).</div><div><br></div><div>I should be be indeed able to intersperse my lowering (<a href="https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124" target="_blank">https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124</a>) with some InlineAsmOp uses.</div><div>I'll report back when I have something.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Nov 14, 2021 at 4:53 PM Simon Pilgrim <<a href="mailto:llvm-dev@redking.me.uk" target="_blank">llvm-dev@redking.me.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>Nicolas - have you investigated just using inline asm instead?<br>
    </p>
    <div>On 11/11/2021 08:34, Wang, Pengfei via
      llvm-dev wrote:<br>
    </div>
    <blockquote type="cite">
      
      
      
      <div>
        <p class="MsoNormal">>As I am very new to this part of LLVM,
          I am not sure what is feasible or not. Would it be
          envisionnable to either:<u></u><u></u></p>
        <p class="MsoNormal">>1. have a way to inject some numeric
          cost to influence the value of some resulting combinations?<u></u><u></u></p>
        <p class="MsoNormal">>2. revive some form of intrinsic and
          guarantee that the instruction would be generated?<u></u><u></u></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)"><u></u> <u></u></span></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)">I think a
            feasible way is to add a new tuningXXX feature for given
            targets and do something different with the flag in the
            combine.<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)">1) seems
            overengineering and 2) seems overkilled for potential
            opportunities by the combine.<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)"><u></u> <u></u></span></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)">Thanks<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)">Phoebe<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="color:rgb(31,73,125)"><u></u> <u></u></span></p>
        <div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0in 0in">
          <p class="MsoNormal"><b>From:</b> llvm-dev
            <a href="mailto:llvm-dev-bounces@lists.llvm.org" target="_blank"><llvm-dev-bounces@lists.llvm.org></a> <b>On Behalf Of
            </b>Nicolas Vasilache via llvm-dev<br>
            <b>Sent:</b> Wednesday, November 10, 2021 5:46 PM<br>
            <b>To:</b> Diego Caballero <a href="mailto:diegocaballero@google.com" target="_blank"><diegocaballero@google.com></a><br>
            <b>Cc:</b> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
            <b>Subject:</b> Re: [llvm-dev] Understanding and controlling
            some of the AVX shuffle emission paths<u></u><u></u></p>
        </div>
        <p class="MsoNormal"><u></u> <u></u></p>
        <div>
          <div>
            <p class="MsoNormal"><u></u> <u></u></p>
          </div>
          <p class="MsoNormal"><u></u> <u></u></p>
          <div>
            <div>
              <p class="MsoNormal">On Wed, Nov 10, 2021 at 10:30 AM
                Diego Caballero <<a href="mailto:diegocaballero@google.com" target="_blank">diegocaballero@google.com</a>>
                wrote:<u></u><u></u></p>
            </div>
            <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <p class="MsoNormal"><a href="mailto:ntv@google.com" target="_blank">+Nicolas
                    Vasilache</a> :)<u></u><u></u></p>
              </div>
            </blockquote>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">Thanks Diego, email is hard, I could
                not find ways to inject myself into my own discussion...<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal"> <u></u><u></u></p>
            </div>
            <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
              <p class="MsoNormal"><u></u> <u></u></p>
              <div>
                <div>
                  <p class="MsoNormal">On Tue, Nov 9, 2021 at 10:32 PM
                    Simon Pilgrim via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>
                    wrote:<u></u><u></u></p>
                </div>
                <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
                  <p class="MsoNormal">On 09/11/2021 20:44, Simon
                    Pilgrim wrote:<br>
                    <br>
                    > On 09/11/2021 08:57, Nicolas Vasilache via
                    llvm-dev wrote:<br>
                    >> Hi everyone,<br>
                    >><br>
                    >> I am experimenting with LLVM lowering,
                    intrinsics and shufflevector <br>
                    >> in general.<br>
                    >><br>
                    >> Here is an IR that I produce with the
                    objective of emitting some <br>
                    >> vblendps instructions: <br>
                    >> <a href="https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a" target="_blank">
https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a</a>.
                    <br>
                    >><br>
                    > From what I can see, the original IR code was
                    (effectively):<br>
                    ><br>
                    > 8 x UNPCKLPS/UNPCKHPS<br>
                    > 4 x SHUFPS<br>
                    > 8 x BLENDPS<br>
                    > 4 x INSERTF128<br>
                    > 4 x PERM2F128<br>
                    ><br>
                    >> I compile this further with<br>
                    >><br>
                    >> clang -x ir -emit-llvm -S -mcpu=haswell -O3
                    -o - | llc -O3 <br>
                    >> -mcpu=haswell - -o -<br>
                    >><br>
                    >> to obtain:<br>
                    >><br>
                    >> <a href="https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a" target="_blank">
https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a</a>
                    <br>
                    >><br>
                    ><br>
                    > and after the x86 shuffle combines:<br>
                    ><br>
                    > 8 x UNPCKLPS/UNPCKHPS<br>
                    > 8 x UNPCKLPD/UNPCKHPD<br>
                    > 4 x INSERTF128<br>
                    > 4 x PERM2F128<br>
                    ><br>
                    > Starting from each BLENDPS, they've combined
                    with the SHUFPS to create <br>
                    > the UNPCK*PD nodes. We nearly always benefit
                    from folding shuffle <br>
                    > chains to reduce total instruction counts, even
                    if some inner nodes <br>
                    > have multiple uses (like the SHUFPS), and I'd
                    hate to lose that.<br>
                    ><br>
                    >> At this point, I would expect to see some
                    vblendps instructions <br>
                    >> generated for the pieces of IR that produce
                    %48/%49 %51/%52 %54/%55 <br>
                    >> and %57/%58 to reduce pressure on port 5
                    (vblendps can also go on <br>
                    >> ports 0 and 1). However the expected
                    instruction does not get <br>
                    >> generated and llvm-mca continues to show me
                    high port 5 contention.<br>
                    >><br>
                    >> Could people suggest some steps / commands
                    to help better understand <br>
                    >> why my expectation is not met and whether I
                    can do something to make <br>
                    >> the compiler generate what I want? Thanks
                    in advance!<br>
                    > So on Haswell, we've gained 4 extra Port5-only
                    shuffles but removed <br>
                    > the 8 Port015 blends.<br>
                    ><br>
                    > We have very little arch-specific shuffle
                    combines, just the <br>
                    > fast-variable-shuffle tuning flags to avoid
                    unnecessary shuffle mask <br>
                    > loads, the shuffle combines just aims for the
                    reduction in simple <br>
                    > target shuffle nodes. And tbh I'm reluctant to
                    add to this as shuffle <br>
                    > combining is complex already.<br>
                    ><br>
                    > We should be preferring to lower/combine to
                    BLENDPS in more <br>
                    > circumstances (its commutable and never slower
                    than any other target <br>
                    > shuffle, although demanded elts can do less
                    with 'undef' elements), <br>
                    > but that won't help us here.<br>
                    ><br>
                    > So far I've failed to find a BLEND-based 8x8
                    transpose pattern that <br>
                    > the shuffle combiner doesn't manage to combine
                    back to the <br>
                    > 8xUNPCK/SHUFPS ops :(<u></u><u></u></p>
                </blockquote>
              </div>
            </blockquote>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">If you are referring to this specific
                code, yes same for me.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">If you are thinking about the general
                8x8 transpose problem, I have a version with
                vector<4xf32> loads that ends up using blends; as
                expected, the port 5 pressure reduction helps and both
                llvm-mca and runtime agree that this is 20-30% faster.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal"> <u></u><u></u></p>
            </div>
            <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
                  <p class="MsoNormal"><br>
                    The only thing I can think of is you might want to
                    see if you can <br>
                    reorder the INSERTF128/PERM2F128 shuffles in between
                    the UNPACK*PS and <br>
                    the SHUFPS/BLENDPS:<br>
                    <br>
                    8 x UNPCKLPS/UNPCKHPS<br>
                    4 x INSERTF128<br>
                    4 x PERM2F128<br>
                    4 x SHUFPS<br>
                    8 x BLENDPS<br>
                    <br>
                    Splitting the per-lane shuffles with the
                    subvector-shuffles could help <br>
                    stop the shuffle combiner.<u></u><u></u></p>
                </blockquote>
              </div>
            </blockquote>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">Right, I tried different variations
                here but invariably getting the same result.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">The vector<4xf32> based version
                is something that I also want to target for a bunch of
                orthogonal reasons.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">I'll note that my use case is MLIR
                codegen with explicit vectors and intrinsics -> LLVM
                so I have quite some flexibility.<u></u><u></u></p>
            </div>
            <p class="MsoNormal">But it feels unnatural in the compiler
              flow to have to branch off at a significant higher-level
              of abstraction to sidestep concerns related to X86
              microarchitecture details.<u></u><u></u></p>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">As I am very new to this part of
                LLVM, I am not sure what is feasible or not. Would it be
                envisionnable to either:<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">1. have a way to inject some numeric
                cost to influence the value of some resulting
                combinations?<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">2. revive some form of intrinsic and
                guarantee that the instruction would be generated?<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">I realize point 2. is contrary to the
                evolution of LLVM as these intrinsics were removed ca.
                2015 in favor of the combiner-based approach.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">Still it seems that `we have very
                little arch-specific shuffle combines` could be the
                signal that such intrinsics are needed?<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal"> <u></u><u></u></p>
            </div>
            <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
                  <p class="MsoNormal"><br>
                    >> I have verified independently that in
                    isolation, a single such <br>
                    >> shuffle creates a vblendps. I see them
                    being recombined in the <br>
                    >> produced assembly and I am looking for
                    experimenting with avoiding <br>
                    >> that vshufps + vblendps + vblendps get
                    recombined into vunpckxxx + <br>
                    >> vunpckxxx instructions.<br>
                    >><br>
                    >> -- <br>
                    _______________________________________________<br>
                    LLVM Developers mailing list<br>
                    <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
                    <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><u></u><u></u></p>
                </blockquote>
              </div>
            </blockquote>
          </div>
          <p class="MsoNormal"><br clear="all">
            <u></u><u></u></p>
          <div>
            <p class="MsoNormal"><u></u> <u></u></p>
          </div>
          <p class="MsoNormal">-- <u></u><u></u></p>
          <div>
            <div>
              <p class="MsoNormal">N<u></u><u></u></p>
            </div>
          </div>
        </div>
      </div>
      <br>
      <fieldset></fieldset>
      <pre>_______________________________________________
LLVM Developers mailing list
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
    </blockquote>
  </div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr">N</div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">N</div></div>