[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Mon Mar 23 03:25:05 PDT 2015

Hi Elena,

>>-----Original Message-----
>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>Sent: 2015年3月23日 15:45
>>To: Hao Liu; 'Arnold Schwaighofer'
>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>>Jiangning Liu; James Molloy; Adam Nemet
>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>>interleaved data accesses
>>
>>I agree with Hao, that a bunch of loads and shuffles will be very
difficult to
>>handle.
>>For interleave factor 4 and vector 8, you'll need 4 masked loads and 3
shuffles,
>>that will never be gathered together in one or two target instruction.
>>
>>We also can consider an "interleave load" as a private case of gather /
scatter,
>>but again, getting the stride and converting back to interleave-load will
be
>>cumbersome.
>>
>>I think that we should go for llvm-common-target intrinsic form till the
>>CodeGen.
>>
>>I propose to add a mask of control flow as a parameter to the intrinsic,
like
>>llvm.masked.load/store in order to allow efficient vectorization of
predicated
>>basic block.
>><8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride,
>>i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)
>>
[Hao Liu] 
I'm curious about how to use this intrinsic to represent interleaved load.
Do you mean the interleaved elements are in the result vector like
       <8 x double>: A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
If this is true. To get two vectors with odd and even elements, we need two
SHUFFLE_VECTORs like:
       %result = <8 x double> @llvm.interleave.load.v8f64(double * %ptr,
...)      // A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
       %even_elements = shufflevector <8 x double> %result, UNDEF, <4 x i32>
<0, 1, 2, 3>
       %odd_elements = shufflevector <8 x double> %result, UNDEF, <4 x i32>
<4, 5, 6, 7>
        // Operations on %even_elements and %odd_elements.
Then how about the interleaved store, it seems we also need shufflevectors
to combine into a big vector and call interleave.store.

Actually I think the intrinsics which are currently used in AArch64/ARM
backends are simpler. Example for 2 interleaved vector:
        %result = call { <2 x double>, <2 x double> }
@llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
        %even_elements = extractvalue { <2 x double>, <2 x double> }
%result, 0
        %odd_elements = extractvalue { <2 x double>, <2 x double> } %result,
1
I think extractvalue is simpler than shufflevector.
Also the interleaved store is simply only one intrinsic like:
         call void @llvm.aarch64.neon.st2.v2f64(<2 x double>* ptr, <2 x
double> %V0, <2 x double> %V1)
So I think maybe we can implement similar intrinsics .

>>-  Elena