[PATCH] D98169: [IR] Permit load/store/alloca for struct with the same scalable vectors.

Sat Mar 13 05:14:59 PST 2021

On Sat, Mar 13, 2021 at 1:43 AM Craig Topper <craig.topper at gmail.com> wrote:

> On Fri, Mar 12, 2021 at 6:51 AM Sander de Smalen via Phabricator <
> reviews at reviews.llvm.org> wrote:
>
>> sdesmalen added a comment.
>>
>> In D98169#2619874 <https://reviews.llvm.org/D98169#2619874>,
>> @craig.topper wrote:
>>
>> > We want this to support the segment load/store intrinsics defined here
>> https://github.com/riscv/rvv-intrinsic-doc/blob/master/intrinsic_funcs/03_vector_load_store_segment_instructions_zvlsseg.md
>> These return 2 to 8 vectors that have been loaded into consecutive
>> registers. I believe SVE has similar instructions. I believe SVE represents
>> these using types wider than their normal scalable vector types and relies
>> on the type legalizer to split them up in the backend. This works for SVE
>> because there is only one known minimum size for all scalable vector types
>> so the type legalizer will always split down to that minimum type.
>>
>> Thanks for providing the context!
>>
>> > For RISC-V vectors we already use 7 different sizes of scalable vectors
>> to represent the ability of our instructions to operate on 2, 4, or 8
>> registers simultaneously. And for 1/2, 1/4, and 1/8 fractional registers.
>> The segment load/store instructions add an extra dimension where they can
>> produce/consume 2, 3, or 4 pairs of registers or 2 quadruples, for
>> examples. Following the SVE strategy would give us ambiguous types for the
>> type legalizer.
>>
>> How does that look in terms of IR? Is the number of registers somehow
>> represented in the (LLVM IR) vector type? Or are the types the same, but
>> the compiler generates different code depending on what mode is set? For
>> SVE we know we can split the vector because <vscale x 8 x i32> is twice the
>> size of <vscale x 4 x i32>, regardless of the value for vscale. Indeed we
>> know SVE vectors area multiple of 128bits, and therefore that <vscale x 4 x
>> i32> is legal. In order to make any assumptions about
>> splitting/legalization, the compiler will need to know which types are
>> legal, and so would expect the compiler to know the mode (2, 4 ,8) for RVV
>> when generating the code, and therefore have similar knowledge about which
>> types are legal and how the vectors are represented/split into registers.
>> How does that lead to ambiguous types?
>>
>
> The mode can be freely changed at any time by emitting a vsetvli
> instruction. Some instructions like zext/sext can take an input in 1
> register and output in 2. Or input in 2 registers and output on 4. The
> output automatically uses an LMUL and element width twice the input. The
> mode for subsequent instructions would need to be changed to operate on
> this widened data. To represent these different modes we're using 7
> different known minimum sized scalable types from 8 bits up to 512 bits.
> LMUL=1/8 uses <vscale x 1 x i8>, LMUL=1/4 uses <vscale x 2 x i8>, <vscale x
> 1 x i16>, and <vscale x 1 x half>, LMUL=1/2 uses <vscale x 4 x i8, 2 x
> i16>, <vscale x 1 x i32>, <vscale x 2 x half>, and <vscale x 1 x float>.
> LMUL=1 uses <vscale x 8 xi8>, <vscale x 4 xi16>, <vscale x 2 x i32>,
> <vscale x 1 x i64>, <vscale x 4 x half>, <vscale x 2 x float>, <vscale x 1
> x double>, etc. All together there are 22 legal types. For each instruction
> we look at the mode it needs for its input and output types and emit a
> vsetvli instruction immediately before. A later MIR pass goes through and
> removes redundant vsetvli instructions created for adjacent instructions.
>
> The segment load/store instructions operate on groups of these 22 types
> with the caveat that the total size cannot exceed 1/4 of the 32 entry
> register file. So there are no segments load/stores for LMUL=8. For LMUL=4
> you can only use a 2x segment load. If were to use scalable types to
> represent segment load/store results as well then <vscale x 4 x i32> could
> either be an LMUL2 register or it could be a x2 segment load of 2 <vscale x
> 2 x i32> values or a x4 segment load of 4 <vscale x 1 x i32> values, etc.
> Since >vscale x 4 x i32> is a legal type it would never be split. Within
> segment loads <vscale x 6 x i32> could either be 6 <vscale x 1 x i32>
> values or 3 <vscale x 2 x i32> values.
>
>

Thanks Craig for providing the full context.
In the current RISC-V V intrinsics document, we need to support
load/store/alloca scalable vector struct in IR.
I understood it is complicated to make TypeSize two dimensions. That is why
I constrain the element types either all scalable types or all fixed length
types. Under the constraint, we have no need to change the definition of
TypeSize. I hope it could minimize the impact of the current IR.

In https://reviews.llvm.org/D97264, we model the segment load/store types
as IR struct types in Clang. If you need more context about the
requirement, I could upstream more downstream implementation about segment
load/store of RISC-V V-extension.

>
>> > To solve this we would like to use a struct for the segment load/stores
>> to separate them in IR. Since clang needs an address for every variable and
>> needs to be able to load/store them we need to support load/store/alloca.
>>
>> These (C/C++-level) intrinsics are probably implemented using
>> target-specific intrinsics or perhaps a common LLVM IR intrinsic like
>> masked.load, which should be able to take/return a struct with scalable
>> members after D94142 <https://reviews.llvm.org/D94142>. If so, it should
>> be possible to handle this in Clang by emitting `extractvalue` instructions
>> and storing each member individually. That would avoid any changes to LLVM
>> IR. Is that something you've considered?
>>
>
> They're using target specific intrinsics which produce an aggregate after
> D94142. We've been having some internal conversations about doing something
> like this for a masked load of 8 registers.
>
> int data[32] = {0};
> vint8mf8_t a0 = vundefined_i8mf8();
> vint8mf8_t a1 = vundefined_i8mf8();
> vint8mf8_t a2 = vundefined_i8mf8();
> vint8mf8_t a3 = vundefined_i8mf8();
> vint8mf8_t a4 = vundefined_i8mf8();
> vint8mf8_t a5 = vundefined_i8mf8();
> vint8mf8_t a6 = vundefined_i8mf8();
> vint8mf8_t a7 = vundefined_i8mf8();
> vlseg8e8_v_i8mf8x8_m(&a0, &a1, &a2, &a3, &a4, &a5, &a6, &a7, data, 4);
>
> instead of using a x8 struct and vget, vset, vcreate. The main
> disadvantage pointed out so far is that the user could pass null or pointer
> cast from another type.
>
>
>>
>> If we do need to make this work for scalable vectors, I think it needs a
>> message to the mailing list because it's a change to the LangRef and
>> capabilities of scalable vectors, given previous discussions on this topic.
>> I'd like to avoid giving the impression that we're quietly moving the
>> goalpost on what scalable vectors can do in IR.
>>
>
> Agreed.
>

I agree. Sorry for sending the patch directly without discussing it in the
mailing list first. I will change the patch as a proof of concept patch. It
just illustrates what we want to do.

We have discussions about the intrinsic interface internally. If there is a
need to enable the capability at the end, I will send out a RFC first. We
could discuss the idea in the mailing list then.

Thanks for all your feedback.

>
>>
>>
>> Repository:
>>   rG LLVM Github Monorepo
>>
>> CHANGES SINCE LAST ACTION
>>   https://reviews.llvm.org/D98169/new/
>>
>> https://reviews.llvm.org/D98169
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210313/5edbff38/attachment-0001.html>