[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Demikhovsky, Elena elena.demikhovsky at intel.com
Tue Mar 24 03:44:03 PDT 2015


> Then how about to use one indexed load/store?
> call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 6, 1, 3, 5, 7>)

But now we want constant stride, not vector of indices right? 

VectorValue = call <4 x double> @llvm.stride.load.v4f64 (BaseAddr, i32 first_ind, i32 stride)
VectorValue = call <4 x double> @llvm.stride.masked.load.v4f64(BaseAddr, i32 first_ind, i32 stride, Mask, PassThru)

void @llvm.stride.store.v8i32 (BaseAddr, i32 first_ind, i32 stride, <8 x i32>VectorValue)
void @llvm.stride.masked.store.v8i32 (BaseAddr, i32 first_ind, i32 stride, , <8 x i32>VectorValue,  Mask)

The Mask here is a control flow mask for predicated basic blocks:

for (i=0; i< SIZE; i++) {
    if (trigger[i] > 0) {         <= the Mask is coming from this "if"!
         A[i*2] +=5;
        B[i*4+1] += 6;
    }
}

We need masks for X86, not every target should support it.

-  Elena


-----Original Message-----
From: Hao Liu [mailto:Hao.Liu at arm.com] 
Sent: Tuesday, March 24, 2015 07:34
To: Demikhovsky, Elena; 'Arnold Schwaighofer'
Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; Jiangning Liu; James Molloy; Adam Nemet
Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Hi Elena,

Oh, I understand your point. Your proposal is exactly the "solution A" in my first mail. My patch is implemented like that and also uses similar intrinsics just without Masks.

To transform several intrinsics into one target intrinsic seems difficult and risky. We put many efforts to match them in the LoopVectorizer. I don't think we want to put a lot of extra efforts to match them again in the backend. 

Example for combining two intrinsics:
    %odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr0, i32 2,
i32 0, i32 align, <8 x i1> %mask, <8 x double> undef)
    %even   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr1, i32
2, i32 1, i32 align, <8 x i1> %mask, <8 x double> undef) The challenges and risks are:
       (1) %ptr0 and %ptr1 cannot be guaranteed to be the same. I know the LoopStrengthReduce pass can modified the pointer. Then we need to check that
%ptr0 and %ptr1 pointed to the same address.
       (2) There could be other instructions between them, so that we also need to do memory dependence check.
       (3) One Intrinsic could be moved to another basic block, then it becomes analysis across basic block. The codegen can only cover one basic block. 
       (4) As Renato says, even when one intrinsic is missing, the AArch64/ARM still can match the left to ldN and it is beneficial. But when one Store intrinsic is missing (E.g. move to another basic block), as the AArch64/ARM doesn't have masked store, matching the left intrinsics is not beneficial.

On the contrary, the "solution B" is to use one intrinsic for the whole interleaved accesses. Matching one-to-one or one-to-N is too much easier than matching N-to-one. 
I admit the problem of my proposed intrinsics is that we need a lot of intrinsics for ld2/ld3/ld4/ld5/..., which seems not reasonable.

Then how about to use one indexed load/store? I think your previous proposal about Indexed Load/Store intrinsics are interesting. For the odd-even examples, we can match to indexed load intrinsic like:
        %result = call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 6, 1, 3, 5, 7>)
        %odd = shufflevector %result, UNDEF, <0, 1, 2, 3>
        %even = shufflevector %result, UNDEF, <4, 5, 6, 7>

The problem is about load/store 3 interleaved vectors. We don't have type like <12 x i32>. One way is to use the masked indexed load/store like:
           %result = shufflevector <4 x i32> %V0, %V1, <0, 1, 2, 3, 4, 5, 6,
7>
           %result1 = sufflevector <4 x i32> %V2, UNDEF, <0, 1, 2, 3, undef, undef, undef, undef>
           %result2 = sufflevector <4 x i32> %result, %result1, <0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, undef, undef, undef, undef>
           %result = call <16 x i32> @llvm.unidex.masked.store(i32 %ptr, <0, 3, 6, 9, 1, 4, 7, 9, 2, 5, 8, 10, undef, undef, undef, undef>, <true, ...., false, false, false, false>) Another simpler way is to define new types like
<12 x i32>, <6 x i32>, so that we can still use no masked intrinsics like
2/4 interleaved load/store. I'm not sure whether this is reasonable.

What do you think?

Thanks,
-Hao

>-----Original Message-----
>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>Sent: 2015年3月23日 20:23
>To: Hao Liu; 'Arnold Schwaighofer'
>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>Jiangning Liu; James Molloy; Adam Nemet
>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>interleaved data accesses
>
>> Actually I think the intrinsics which are currently used in 
>> AArch64/ARM
>backends are simpler. Example for 2 interleaved vector:
>>        %result = call { <2 x double>, <2 x double> }
>> @llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>
>It is simple, but
>
>1) It is not safe due to possible memory access after eof buffer
>2) I don't want to load odd elements if I need only even - nobody says 
>that
it
>should be implemented by sequential loads with shuffle
>3) What happens if stride is 3 or 4?
>4) What happens if the block is predicated?
>
>To represent the interleaved load that you want to achieve with 
>suggested intrinsic, you need 2 calls %even = <8 x double> 
>@llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 0, i32 align, <8 
>x
>i1> %mask, <8 x double> undef)
>%odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2, i32
1,
>i32 align, <8 x i1> %mask, <8 x double> undef)
>
>You can translate these 2 calls into one target specific on codegen 
>pass,
if the
>mask is "all true", of course.
>
>-  Elena
>
>
>-----Original Message-----
>From: Hao Liu [mailto:Hao.Liu at arm.com]
>Sent: Monday, March 23, 2015 12:25
>To: Demikhovsky, Elena; 'Arnold Schwaighofer'
>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>Jiangning Liu; James Molloy; Adam Nemet
>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>interleaved data accesses
>
>Hi Elena,
>
>>>-----Original Message-----
>>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>>Sent: 2015年3月23日 15:45
>>>To: Hao Liu; 'Arnold Schwaighofer'
>>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>>>Jiangning Liu; James Molloy; Adam Nemet
>>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>>>interleaved data accesses
>>>
>>>I agree with Hao, that a bunch of loads and shuffles will be very
>difficult to
>>>handle.
>>>For interleave factor 4 and vector 8, you'll need 4 masked loads and 
>>>3
>shuffles,
>>>that will never be gathered together in one or two target instruction.
>>>
>>>We also can consider an "interleave load" as a private case of gather 
>>>/
>scatter,
>>>but again, getting the stride and converting back to interleave-load 
>>>will
>be
>>>cumbersome.
>>>
>>>I think that we should go for llvm-common-target intrinsic form till 
>>>the CodeGen.
>>>
>>>I propose to add a mask of control flow as a parameter to the 
>>>intrinsic,
>like
>>>llvm.masked.load/store in order to allow efficient vectorization of
>predicated
>>>basic block.
>>><8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride,
>>>i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)
>>>
>[Hao Liu]
>I'm curious about how to use this intrinsic to represent interleaved load.
>Do you mean the interleaved elements are in the result vector like
>       <8 x double>: A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7] If 
>this
is true. To
>get two vectors with odd and even elements, we need two SHUFFLE_VECTORs
>like:
>       %result = <8 x double> @llvm.interleave.load.v8f64(double * %ptr,
>...)      // A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
>       %even_elements = shufflevector <8 x double> %result, UNDEF, <4 x
i32>
><0, 1, 2, 3>
>       %odd_elements = shufflevector <8 x double> %result, UNDEF, <4 x 
>i32>
<4,
>5, 6, 7>
>        // Operations on %even_elements and %odd_elements.
>Then how about the interleaved store, it seems we also need 
>shufflevectors
to
>combine into a big vector and call interleave.store.
>
>Actually I think the intrinsics which are currently used in AArch64/ARM 
>backends are simpler. Example for 2 interleaved vector:
>        %result = call { <2 x double>, <2 x double> }
>@llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>        %even_elements = extractvalue { <2 x double>, <2 x double> }
%result, 0
>        %odd_elements = extractvalue { <2 x double>, <2 x double> }
%result,
>1
>I think extractvalue is simpler than shufflevector.
>Also the interleaved store is simply only one intrinsic like:
>         call void @llvm.aarch64.neon.st2.v2f64(<2 x double>* ptr, <2 x
>double> %V0, <2 x double> %V1)
>So I think maybe we can implement similar intrinsics .
>
>>>-  Elena
>
>
>
>
>---------------------------------------------------------------------
>Intel Israel (74) Limited
>
>This e-mail and any attachments may contain confidential material for 
>the
sole
>use of the intended recipient(s). Any review or distribution by others 
>is
strictly
>prohibited. If you are not the intended recipient, please contact the
sender and
>delete all copies.
>




---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.





More information about the llvm-commits mailing list