<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 4 Feb 2019 at 22:04, Simon Moll <<a href="mailto:moll@cs.uni-saarland.de">moll@cs.uni-saarland.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF">

    <div class="gmail-m_1539775417466410328moz-cite-prefix">On 2/4/19 9:18 PM, Robin Kruppe wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Mon, 4 Feb 2019 at 18:15,

            David Greene via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Simon Moll <<a href="mailto:moll@cs.uni-saarland.de" target="_blank">moll@cs.uni-saarland.de</a>>

            writes:<br>

            <br>

            > You are referring to the sub-vector sizes, if i am

            understanding<br>

            > correctly. I'd assume that the mask sub-vector length

            always has to be<br>

            > either 1 or the same as the data sub-vector length. For

            example, this<br>

            > is ok:<br>

            ><br>

            > %result = call <scalable 3 x float>

            @llvm.evl.fsub.v4f32(<scalable 3 x<br>

            > float> %x, <scalable 3 x float> %y,

            <scalable 1 x i1> %M, i32 %L)<br>

            <br>

            What does <scalable 1 x i1> applied to <scalable 3

            x float> mean?  I<br>

            would expect a requirement of <scalable 3 x i1>.  At

            least that's how I<br>

            understood the SVE proposal [1].  The n's in <scalable n

            x type> have to<br>

            match.<br>

          </blockquote>

          <div><br>

          </div>

          <div>I believe the idea is to allow each single mask bit to

            control multiple consecutive lanes at once, effectively

            interpreting the vector being operated on as "many short

            fixed-length vectors, concatenated" rather than a single

            long vector of scalars. This is a different interpretation

            of that type than usual, but it's not crazy, e.g. a similar

            reinterpretation of vector types seems to be the favored

            approach for adding matrix operations to LLVM IR. It

            somewhat obscures the point to discuss this only for

            scalable vectors, there's no conceptual reason why one

            couldn't do the same with fixed size vectors.</div>

          <div><br>

          </div>

          <div>In fact, I would recommend against making almost any new

            feature or intrinsic exclusive to scalable vectors,

            including this one: there shouldn't be much extra code

            required to allow and support it, and not doing so makes the

            IR less orthogonal. For example, if a <scalable 4 x

            float> fadd with a <scalable 1 x i1> mask works,

            then <4 x float> fadd with a <1 x i1> mask, a

            <8 x float> fadd with a <2 x i1> mask, etc.

            should also be possible overloads of the same intrinsic.<br>

          </div>

        </div>

      </div>

    </blockquote>

    Yep. Doing the same for standard vector IR is on the radar:

    <a class="gmail-m_1539775417466410328moz-txt-link-freetext" href="https://reviews.llvm.org/D57504#1380587" target="_blank">https://reviews.llvm.org/D57504#1380587</a>.<br>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <div>So far, so good. A bit odd, when I think about it, but if

            hardware out there has that capability, maybe this is a good

            way to encode it in IR (other options might work too,

            though). The crux, however, is the interaction with the

            dynamic vector length: is it in terms of the mask? the

            longer data vector? if the latter, what happens if it isn't

            divisible by the mask length? There are multiple options and

            it's not clear to me which one is "the right one", both for

            architectures with native support (hopefully the one brough

            up here won't be the only one) and for internal consistency

            of the IR. If there was an established architecture with

            this kind of feature where people have gathered lots of

            practical experience with it, we could use that inform the

            decision (just as we have for ordinary predication and

            dynamic vector length). But I'm not aware of any

            architecture that does this other than the one Jacob and

            lkcl are working on, and as far as I know their project

            still in the early stages.<br>

          </div>

        </div>

      </div>

    </blockquote>

    <p>The current understanding is that the dynamic vector length

      operates in the granularity of the mask:

      <a class="gmail-m_1539775417466410328moz-txt-link-freetext" href="https://reviews.llvm.org/D57504#1381211" target="_blank">https://reviews.llvm.org/D57504#1381211</a></p></div></blockquote><div>I do understand that this is what Jacob proposes based on the architecture he works on. However, it is not yet clear to me whether that is the most useful option overall, nor that it is the only option that will lead to reasonable codegen for their architecture. But let's leave discussion of the details on Phab. I just want to highlight one issue that is not specific to Jacob's angle, as it relates to the interpretation of scalable vectors more generally:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">

    <p>In unscaled IR types, this means VL masks each scalar result, in

      scaled types VL masks sub vectors. E.g. for %L == 1 the following

      call produces a pair of floats as the result:<br>

    </p>

    <p><span class="gmail-m_1539775417466410328transaction-comment">

        </span></p><div class="gmail_quote">

          <pre class="gmail-m_1539775417466410328remarkup-code">   <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)</pre></div></div></blockquote><div>As I wrote on Phab mere minutes before you sent this email, I do not think this is the right interpretation for any architecture I know about (I do not know anything about the things Jacob and Luke are working on) nor from the POV of the scalable vector types proposal. A scalable vector is not conventionally "a variable-length vector of fixed-size vectors", it it simply an ordinary "flat" vector whose length happens to be mostly unknown at compile time. If some intrinsics want to interpret it differently, that is fine, but that's a property of those specific intrinsics -- similar to how proposed matrix intrinsics might interpret a 16 element vector as a 4x4 matrix.<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><div class="gmail_quote"><pre class="gmail-m_1539775417466410328remarkup-code"></pre>

          <p><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment">I

                  agree that we should only consider the tied sub-vector

                  case for this first version and keep discussing the

                  unconstrained version. It is seductively easy to allow

                  this but impossible to take it back.</span></span></span></p>

          <p><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"></span></span></span></p>

          <pre class="gmail-m_1539775417466410328remarkup-code"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment">---

</span></span></span></pre>

          <p><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment">The

                  story is different when we talk only(!) about memory

                  accesses and having different vector sizes in the

                  operands and the transferred type (result type for

                  loads, value operand type for stores):</span></span></span></p>

          <span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment"></span></span></span>

          <p class="gmail-m_1539775417466410328remarkup-code">Eg on AVX, this call could turn into

            a 64bit gather operation of pairs of floats:<br>

          </p>

          <pre><tt>    <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1> mask %M, i32 vlen 8)</tt></pre></div></div></blockquote><div>Is that IR you'd expect someone to generate (or a backend to consume) for this operation? It seems like a rather unnatural or "magical" way to represent the intent (load 64b each from 8 pointers), at least with the way I'm thinking about it. 

I'd expect a gather of 8xi64 and a bitcast.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><div class="gmail_quote">

        </div>

      <span class="gmail-m_1539775417466410328transaction-comment">

        <div class="gmail_quote"><span class="gmail-m_1539775417466410328transaction-comment">And there

            is a native 16 x 16 element load (VLD2D) on SX-Aurora, which

            may be represented as:<br>

          </span></div>

      </span><span class="gmail-m_1539775417466410328transaction-comment">

        <div class="gmail_quote"><span class="gmail-m_1539775417466410328transaction-comment"><span class="gmail-m_1539775417466410328transaction-comment">

              <pre><tt>    <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)</tt></pre></span></span></div></span></div></blockquote><div>In contrast to the above I can't very well say one should write this as a gather of i1024, but it also seems like a rather specialized instruction (presumably used for blocked processing of matrices?) so I can't say that this on its own motivates me to complicate a proposed core IR construct.<br></div><div><br></div><div>Cheers,</div><div>Robin</div><br></div></div>