[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Wed Mar 25 01:16:56 PDT 2015

> Interleave access is not a kind of stride access, but it is a subset of index access. That's why I think it is reasonable to represent it with indexed load/store intrinsic.
Ok, let's go back to indexed.
       %result = call <8 x i32> @llvm.index.load.v8i32(i32 %ptr, <0, 2, 4, 6, 1, 3, 5, 7>)
When you define intrinsic, you can say that indices is a vector of signed integer compile-time constants. I'm ok to start with this form.

-  Elena

-----Original Message-----
From: Hao Liu [mailto:Hao.Liu at arm.com] 
Sent: Wednesday, March 25, 2015 03:44
To: Demikhovsky, Elena; 'Arnold Schwaighofer'; renato.golin at linaro.org
Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; Jiangning Liu; James Molloy; Adam Nemet
Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

>-----Original Message-----
>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>
>> Then how about to use one indexed load/store?
>> call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 6, 1, 3, 5, 7>)
>
>But now we want constant stride, not vector of indices right?
>
[Hao Liu]
No. AArch64/ARM supports interleave access, but doesn't support neither stride access or mask access.

A interleave access could be represented by several apart stride access. If we identify interleave access in the middle end, represent it with apart stride access intrinsics. The AArch64/ARM backend has to identify and generate interleave access intrinsic again from apart stride access intrinsics, which is expensive and risky.

Interleave access is not a kind of stride access, but it is a subset of index access. That's why I think it is reasonable to represent it with indexed load/store intrinsic.

>VectorValue = call <4 x double> @llvm.stride.load.v4f64 (BaseAddr, i32 
>first_ind, i32 stride) VectorValue = call <4 x double> 
>@llvm.stride.masked.load.v4f64(BaseAddr, i32 first_ind, i32 stride, 
>Mask,
>PassThru)
>
>void @llvm.stride.store.v8i32 (BaseAddr, i32 first_ind, i32 stride, <8 
>x
>i32>VectorValue) void @llvm.stride.masked.store.v8i32 (BaseAddr, i32
>first_ind, i32 stride, , <8 x i32>VectorValue,  Mask)
>
>The Mask here is a control flow mask for predicated basic blocks:
>
>for (i=0; i< SIZE; i++) {
>    if (trigger[i] > 0) {         <= the Mask is coming from this "if"!
>         A[i*2] +=5;
>        B[i*4+1] += 6;
>    }
>}
>
>We need masks for X86, not every target should support it.
>
>-  Elena
>
>
>-----Original Message-----
>From: Hao Liu [mailto:Hao.Liu at arm.com]
>Sent: Tuesday, March 24, 2015 07:34
>To: Demikhovsky, Elena; 'Arnold Schwaighofer'
>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>Jiangning Liu; James Molloy; Adam Nemet
>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>interleaved data accesses
>
>Hi Elena,
>
>Oh, I understand your point. Your proposal is exactly the "solution A" 
>in
my first
>mail. My patch is implemented like that and also uses similar 
>intrinsics
just
>without Masks.
>
>To transform several intrinsics into one target intrinsic seems 
>difficult
and risky.
>We put many efforts to match them in the LoopVectorizer. I don't think 
>we want to put a lot of extra efforts to match them again in the backend.
>
>Example for combining two intrinsics:
>    %odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr0, i32
2,
>i32 0, i32 align, <8 x i1> %mask, <8 x double> undef)
>    %even   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr1, i32
>2, i32 1, i32 align, <8 x i1> %mask, <8 x double> undef) The challenges 
>and risks are:
>       (1) %ptr0 and %ptr1 cannot be guaranteed to be the same. I know 
>the LoopStrengthReduce pass can modified the pointer. Then we need to 
>check that
>%ptr0 and %ptr1 pointed to the same address.
>       (2) There could be other instructions between them, so that we 
>also
need
>to do memory dependence check.
>       (3) One Intrinsic could be moved to another basic block, then it
becomes
>analysis across basic block. The codegen can only cover one basic block.
>       (4) As Renato says, even when one intrinsic is missing, the
AArch64/ARM
>still can match the left to ldN and it is beneficial. But when one 
>Store
intrinsic
>is missing (E.g. move to another basic block), as the AArch64/ARM 
>doesn't have masked store, matching the left intrinsics is not beneficial.
>
>On the contrary, the "solution B" is to use one intrinsic for the whole 
>interleaved accesses. Matching one-to-one or one-to-N is too much 
>easier
than
>matching N-to-one.
>I admit the problem of my proposed intrinsics is that we need a lot of
intrinsics
>for ld2/ld3/ld4/ld5/..., which seems not reasonable.
>
>Then how about to use one indexed load/store? I think your previous
proposal
>about Indexed Load/Store intrinsics are interesting. For the odd-even
examples,
>we can match to indexed load intrinsic like:
>        %result = call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 
>6,
1, 3, 5, 7>)
>        %odd = shufflevector %result, UNDEF, <0, 1, 2, 3>
>        %even = shufflevector %result, UNDEF, <4, 5, 6, 7>
>
>The problem is about load/store 3 interleaved vectors. We don't have 
>type
like
><12 x i32>. One way is to use the masked indexed load/store like:
>           %result = shufflevector <4 x i32> %V0, %V1, <0, 1, 2, 3, 4, 
>5,
6,
>7>
>           %result1 = sufflevector <4 x i32> %V2, UNDEF, <0, 1, 2, 3,
undef, undef,
>undef, undef>
>           %result2 = sufflevector <4 x i32> %result, %result1, <0, 1, 
>2,
3, 4, 5, 6, 7,
>8, 9, 10, 11, undef, undef, undef, undef>
>           %result = call <16 x i32> @llvm.unidex.masked.store(i32 
>%ptr,
<0, 3, 6,
>9, 1, 4, 7, 9, 2, 5, 8, 10, undef, undef, undef, undef>, <true, ....,
false, false, false,
>false>) Another simpler way is to define new types like
><12 x i32>, <6 x i32>, so that we can still use no masked intrinsics 
>like
>2/4 interleaved load/store. I'm not sure whether this is reasonable.
>
>What do you think?
>
>Thanks,
>-Hao
>
>>-----Original Message-----
>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>Sent: 2015年3月23日 20:23
>>To: Hao Liu; 'Arnold Schwaighofer'
>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>>Jiangning Liu; James Molloy; Adam Nemet
>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>>interleaved data accesses
>>
>>> Actually I think the intrinsics which are currently used in 
>>> AArch64/ARM
>>backends are simpler. Example for 2 interleaved vector:
>>>        %result = call { <2 x double>, <2 x double> }
>>> @llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>>
>>It is simple, but
>>
>>1) It is not safe due to possible memory access after eof buffer
>>2) I don't want to load odd elements if I need only even - nobody says 
>>that
>it
>>should be implemented by sequential loads with shuffle
>>3) What happens if stride is 3 or 4?
>>4) What happens if the block is predicated?
>>
>>To represent the interleaved load that you want to achieve with 
>>suggested intrinsic, you need 2 calls %even = <8 x double> 
>>@llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 0, i32 align, <8 
>>x
>>i1> %mask, <8 x double> undef)
>>%odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2,
i32
>1,
>>i32 align, <8 x i1> %mask, <8 x double> undef)
>>
>>You can translate these 2 calls into one target specific on codegen 
>>pass,
>if the
>>mask is "all true", of course.
>>
>>-  Elena
>>
>>
>>-----Original Message-----
>>From: Hao Liu [mailto:Hao.Liu at arm.com]
>>Sent: Monday, March 23, 2015 12:25
>>To: Demikhovsky, Elena; 'Arnold Schwaighofer'
>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>>Jiangning Liu; James Molloy; Adam Nemet
>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>>interleaved data accesses
>>
>>Hi Elena,
>>
>>>>-----Original Message-----
>>>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>>>Sent: 2015年3月23日 15:45
>>>>To: Hao Liu; 'Arnold Schwaighofer'
>>>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>>>>Jiangning Liu; James Molloy; Adam Nemet
>>>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>>>>interleaved data accesses
>>>>
>>>>I agree with Hao, that a bunch of loads and shuffles will be very
>>difficult to
>>>>handle.
>>>>For interleave factor 4 and vector 8, you'll need 4 masked loads and
>>>>3
>>shuffles,
>>>>that will never be gathered together in one or two target instruction.
>>>>
>>>>We also can consider an "interleave load" as a private case of 
>>>>gather /
>>scatter,
>>>>but again, getting the stride and converting back to interleave-load 
>>>>will
>>be
>>>>cumbersome.
>>>>
>>>>I think that we should go for llvm-common-target intrinsic form till 
>>>>the CodeGen.
>>>>
>>>>I propose to add a mask of control flow as a parameter to the 
>>>>intrinsic,
>>like
>>>>llvm.masked.load/store in order to allow efficient vectorization of
>>predicated
>>>>basic block.
>>>><8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride,
>>>>i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)
>>>>
>>[Hao Liu]
>>I'm curious about how to use this intrinsic to represent interleaved load.
>>Do you mean the interleaved elements are in the result vector like
>>       <8 x double>: A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7] If 
>>this
>is true. To
>>get two vectors with odd and even elements, we need two 
>>SHUFFLE_VECTORs
>>like:
>>       %result = <8 x double> @llvm.interleave.load.v8f64(double * %ptr,
>>...)      // A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
>>       %even_elements = shufflevector <8 x double> %result, UNDEF, <4 
>>x
>i32>
>><0, 1, 2, 3>
>>       %odd_elements = shufflevector <8 x double> %result, UNDEF, <4 x
>>i32>
><4,
>>5, 6, 7>
>>        // Operations on %even_elements and %odd_elements.
>>Then how about the interleaved store, it seems we also need 
>>shufflevectors
>to
>>combine into a big vector and call interleave.store.
>>
>>Actually I think the intrinsics which are currently used in 
>>AArch64/ARM backends are simpler. Example for 2 interleaved vector:
>>        %result = call { <2 x double>, <2 x double> }
>>@llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>>        %even_elements = extractvalue { <2 x double>, <2 x double> }
>%result, 0
>>        %odd_elements = extractvalue { <2 x double>, <2 x double> }
>%result,
>>1
>>I think extractvalue is simpler than shufflevector.
>>Also the interleaved store is simply only one intrinsic like:
>>         call void @llvm.aarch64.neon.st2.v2f64(<2 x double>* ptr, <2 
>>x
>>double> %V0, <2 x double> %V1)
>>So I think maybe we can implement similar intrinsics .
>>
>>>>-  Elena
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>Intel Israel (74) Limited
>>
>>This e-mail and any attachments may contain confidential material for 
>>the
>sole
>>use of the intended recipient(s). Any review or distribution by others 
>>is
>strictly
>>prohibited. If you are not the intended recipient, please contact the
>sender and
>>delete all copies.
>>
>
>
>
>
>---------------------------------------------------------------------
>Intel Israel (74) Limited
>
>This e-mail and any attachments may contain confidential material for 
>the
sole
>use of the intended recipient(s). Any review or distribution by others 
>is
strictly
>prohibited. If you are not the intended recipient, please contact the
sender and
>delete all copies.
>

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.