[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Mon Mar 23 05:23:12 PDT 2015

> Actually I think the intrinsics which are currently used in AArch64/ARM backends are simpler. Example for 2 interleaved vector:
>        %result = call { <2 x double>, <2 x double> } @llvm.aarch64.ld2.v2f64(<2 x double>* ptr)

It is simple, but

1) It is not safe due to possible memory access after eof buffer
2) I don't want to load odd elements if I need only even - nobody says that it should be implemented by sequential loads with shuffle
3) What happens if stride is 3 or 4?
4) What happens if the block is predicated?

To represent the interleaved load that you want to achieve with suggested intrinsic, you need 2 calls
%even = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 0, i32 align, <8 x i1> %mask, <8 x double> undef)
%odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 1, i32 align, <8 x i1> %mask, <8 x double> undef)

You can translate these 2 calls into one target specific on codegen pass, if the mask is "all true", of course.

-  Elena

-----Original Message-----
From: Hao Liu [mailto:Hao.Liu at arm.com] 
Sent: Monday, March 23, 2015 12:25
To: Demikhovsky, Elena; 'Arnold Schwaighofer'
Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; Jiangning Liu; James Molloy; Adam Nemet
Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Hi Elena,

>>-----Original Message-----
>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>Sent: 2015年3月23日 15:45
>>To: Hao Liu; 'Arnold Schwaighofer'
>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM; 
>>Jiangning Liu; James Molloy; Adam Nemet
>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>>interleaved data accesses
>>
>>I agree with Hao, that a bunch of loads and shuffles will be very
difficult to
>>handle.
>>For interleave factor 4 and vector 8, you'll need 4 masked loads and 3
shuffles,
>>that will never be gathered together in one or two target instruction.
>>
>>We also can consider an "interleave load" as a private case of gather 
>>/
scatter,
>>but again, getting the stride and converting back to interleave-load 
>>will
be
>>cumbersome.
>>
>>I think that we should go for llvm-common-target intrinsic form till 
>>the CodeGen.
>>
>>I propose to add a mask of control flow as a parameter to the 
>>intrinsic,
like
>>llvm.masked.load/store in order to allow efficient vectorization of
predicated
>>basic block.
>><8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride,
>>i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)
>>
[Hao Liu]
I'm curious about how to use this intrinsic to represent interleaved load.
Do you mean the interleaved elements are in the result vector like
       <8 x double>: A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7] If this is true. To get two vectors with odd and even elements, we need two SHUFFLE_VECTORs like:
       %result = <8 x double> @llvm.interleave.load.v8f64(double * %ptr,
...)      // A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
       %even_elements = shufflevector <8 x double> %result, UNDEF, <4 x i32> <0, 1, 2, 3>
       %odd_elements = shufflevector <8 x double> %result, UNDEF, <4 x i32> <4, 5, 6, 7>
        // Operations on %even_elements and %odd_elements.
Then how about the interleaved store, it seems we also need shufflevectors to combine into a big vector and call interleave.store.

Actually I think the intrinsics which are currently used in AArch64/ARM backends are simpler. Example for 2 interleaved vector:
        %result = call { <2 x double>, <2 x double> }
@llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
        %even_elements = extractvalue { <2 x double>, <2 x double> } %result, 0
        %odd_elements = extractvalue { <2 x double>, <2 x double> } %result,
1
I think extractvalue is simpler than shufflevector.
Also the interleaved store is simply only one intrinsic like:
         call void @llvm.aarch64.neon.st2.v2f64(<2 x double>* ptr, <2 x
double> %V0, <2 x double> %V1)
So I think maybe we can implement similar intrinsics .

>>-  Elena

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.