[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Mon Mar 23 22:33:56 PDT 2015

Hi Elena,

Oh, I understand your point. Your proposal is exactly the "solution A" in my
first mail. My patch is implemented like that and also uses similar
intrinsics just without Masks.

To transform several intrinsics into one target intrinsic seems difficult
and risky. We put many efforts to match them in the LoopVectorizer. I don't
think we want to put a lot of extra efforts to match them again in the
backend. 

Example for combining two intrinsics:
    %odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr0, i32 2,
i32 0, i32 align, <8 x i1> %mask, <8 x double> undef)
    %even   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr1, i32
2, i32 1, i32 align, <8 x i1> %mask, <8 x double> undef)
The challenges and risks are:
       (1) %ptr0 and %ptr1 cannot be guaranteed to be the same. I know the
LoopStrengthReduce pass can modified the pointer. Then we need to check that
%ptr0 and %ptr1 pointed to the same address.
       (2) There could be other instructions between them, so that we also
need to do memory dependence check.
       (3) One Intrinsic could be moved to another basic block, then it
becomes analysis across basic block. The codegen can only cover one basic
block. 
       (4) As Renato says, even when one intrinsic is missing, the
AArch64/ARM still can match the left to ldN and it is beneficial. But when
one Store intrinsic is missing (E.g. move to another basic block), as the
AArch64/ARM doesn't have masked store, matching the left intrinsics is not
beneficial.

On the contrary, the "solution B" is to use one intrinsic for the whole
interleaved accesses. Matching one-to-one or one-to-N is too much easier
than matching N-to-one. 
I admit the problem of my proposed intrinsics is that we need a lot of
intrinsics for ld2/ld3/ld4/ld5/..., which seems not reasonable.

Then how about to use one indexed load/store? I think your previous proposal
about Indexed Load/Store intrinsics are interesting. For the odd-even
examples, we can match to indexed load intrinsic like:
        %result = call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 6, 1,
3, 5, 7>)
        %odd = shufflevector %result, UNDEF, <0, 1, 2, 3>
        %even = shufflevector %result, UNDEF, <4, 5, 6, 7>

The problem is about load/store 3 interleaved vectors. We don't have type
like <12 x i32>. One way is to use the masked indexed load/store like:
           %result = shufflevector <4 x i32> %V0, %V1, <0, 1, 2, 3, 4, 5, 6,
7>
           %result1 = sufflevector <4 x i32> %V2, UNDEF, <0, 1, 2, 3, undef,
undef, undef, undef>
           %result2 = sufflevector <4 x i32> %result, %result1, <0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, undef, undef, undef, undef>
           %result = call <16 x i32> @llvm.unidex.masked.store(i32 %ptr, <0,
3, 6, 9, 1, 4, 7, 9, 2, 5, 8, 10, undef, undef, undef, undef>, <true, ....,
false, false, false, false>) Another simpler way is to define new types like
<12 x i32>, <6 x i32>, so that we can still use no masked intrinsics like
2/4 interleaved load/store. I'm not sure whether this is reasonable.

What do you think?

Thanks,
-Hao

>-----Original Message-----
>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>Sent: 2015年3月23日 20:23
>To: Hao Liu; 'Arnold Schwaighofer'
>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>Jiangning Liu; James Molloy; Adam Nemet
>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>interleaved data accesses
>
>> Actually I think the intrinsics which are currently used in AArch64/ARM
>backends are simpler. Example for 2 interleaved vector:
>>        %result = call { <2 x double>, <2 x double> }
>> @llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>
>It is simple, but
>
>1) It is not safe due to possible memory access after eof buffer
>2) I don't want to load odd elements if I need only even - nobody says that
it
>should be implemented by sequential loads with shuffle
>3) What happens if stride is 3 or 4?
>4) What happens if the block is predicated?
>
>To represent the interleaved load that you want to achieve with suggested
>intrinsic, you need 2 calls %even = <8 x double>
>@llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 0, i32 align, <8 x
>i1> %mask, <8 x double> undef)
>%odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2, i32
1,
>i32 align, <8 x i1> %mask, <8 x double> undef)
>
>You can translate these 2 calls into one target specific on codegen pass,
if the
>mask is "all true", of course.
>
>-  Elena
>
>
>-----Original Message-----
>From: Hao Liu [mailto:Hao.Liu at arm.com]
>Sent: Monday, March 23, 2015 12:25
>To: Demikhovsky, Elena; 'Arnold Schwaighofer'
>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>Jiangning Liu; James Molloy; Adam Nemet
>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>interleaved data accesses
>
>Hi Elena,
>
>>>-----Original Message-----
>>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>>Sent: 2015年3月23日 15:45
>>>To: Hao Liu; 'Arnold Schwaighofer'
>>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>>>Jiangning Liu; James Molloy; Adam Nemet
>>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>>>interleaved data accesses
>>>
>>>I agree with Hao, that a bunch of loads and shuffles will be very
>difficult to
>>>handle.
>>>For interleave factor 4 and vector 8, you'll need 4 masked loads and 3
>shuffles,
>>>that will never be gathered together in one or two target instruction.
>>>
>>>We also can consider an "interleave load" as a private case of gather
>>>/
>scatter,
>>>but again, getting the stride and converting back to interleave-load
>>>will
>be
>>>cumbersome.
>>>
>>>I think that we should go for llvm-common-target intrinsic form till
>>>the CodeGen.
>>>
>>>I propose to add a mask of control flow as a parameter to the
>>>intrinsic,
>like
>>>llvm.masked.load/store in order to allow efficient vectorization of
>predicated
>>>basic block.
>>><8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride,
>>>i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)
>>>
>[Hao Liu]
>I'm curious about how to use this intrinsic to represent interleaved load.
>Do you mean the interleaved elements are in the result vector like
>       <8 x double>: A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7] If this
is true. To
>get two vectors with odd and even elements, we need two SHUFFLE_VECTORs
>like:
>       %result = <8 x double> @llvm.interleave.load.v8f64(double * %ptr,
>...)      // A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
>       %even_elements = shufflevector <8 x double> %result, UNDEF, <4 x
i32>
><0, 1, 2, 3>
>       %odd_elements = shufflevector <8 x double> %result, UNDEF, <4 x i32>
<4,
>5, 6, 7>
>        // Operations on %even_elements and %odd_elements.
>Then how about the interleaved store, it seems we also need shufflevectors
to
>combine into a big vector and call interleave.store.
>
>Actually I think the intrinsics which are currently used in AArch64/ARM
>backends are simpler. Example for 2 interleaved vector:
>        %result = call { <2 x double>, <2 x double> }
>@llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>        %even_elements = extractvalue { <2 x double>, <2 x double> }
%result, 0
>        %odd_elements = extractvalue { <2 x double>, <2 x double> }
%result,
>1
>I think extractvalue is simpler than shufflevector.
>Also the interleaved store is simply only one intrinsic like:
>         call void @llvm.aarch64.neon.st2.v2f64(<2 x double>* ptr, <2 x
>double> %V0, <2 x double> %V1)
>So I think maybe we can implement similar intrinsics .
>
>>>-  Elena
>
>
>
>
>---------------------------------------------------------------------
>Intel Israel (74) Limited
>
>This e-mail and any attachments may contain confidential material for the
sole
>use of the intended recipient(s). Any review or distribution by others is
strictly
>prohibited. If you are not the intended recipient, please contact the
sender and
>delete all copies.
>