[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Tue Mar 24 18:43:48 PDT 2015

>-----Original Message-----
>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>
>> Then how about to use one indexed load/store?
>> call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 6, 1, 3, 5, 7>)
>
>But now we want constant stride, not vector of indices right?
>
[Hao Liu] 
No. AArch64/ARM supports interleave access, but doesn't support neither
stride access or mask access.

A interleave access could be represented by several apart stride access. If
we identify interleave access in the middle end, represent it with apart
stride access intrinsics. The AArch64/ARM backend has to identify and
generate interleave access intrinsic again from apart stride access
intrinsics, which is expensive and risky.

Interleave access is not a kind of stride access, but it is a subset of
index access. That's why I think it is reasonable to represent it with
indexed load/store intrinsic.

>VectorValue = call <4 x double> @llvm.stride.load.v4f64 (BaseAddr, i32
>first_ind, i32 stride) VectorValue = call <4 x double>
>@llvm.stride.masked.load.v4f64(BaseAddr, i32 first_ind, i32 stride, Mask,
>PassThru)
>
>void @llvm.stride.store.v8i32 (BaseAddr, i32 first_ind, i32 stride, <8 x
>i32>VectorValue) void @llvm.stride.masked.store.v8i32 (BaseAddr, i32
>first_ind, i32 stride, , <8 x i32>VectorValue,  Mask)
>
>The Mask here is a control flow mask for predicated basic blocks:
>
>for (i=0; i< SIZE; i++) {
>    if (trigger[i] > 0) {         <= the Mask is coming from this "if"!
>         A[i*2] +=5;
>        B[i*4+1] += 6;
>    }
>}
>
>We need masks for X86, not every target should support it.
>
>-  Elena
>
>
>-----Original Message-----
>From: Hao Liu [mailto:Hao.Liu at arm.com]
>Sent: Tuesday, March 24, 2015 07:34
>To: Demikhovsky, Elena; 'Arnold Schwaighofer'
>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>Jiangning Liu; James Molloy; Adam Nemet
>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>interleaved data accesses
>
>Hi Elena,
>
>Oh, I understand your point. Your proposal is exactly the "solution A" in
my first
>mail. My patch is implemented like that and also uses similar intrinsics
just
>without Masks.
>
>To transform several intrinsics into one target intrinsic seems difficult
and risky.
>We put many efforts to match them in the LoopVectorizer. I don't think we
>want to put a lot of extra efforts to match them again in the backend.
>
>Example for combining two intrinsics:
>    %odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr0, i32
2,
>i32 0, i32 align, <8 x i1> %mask, <8 x double> undef)
>    %even   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr1, i32
>2, i32 1, i32 align, <8 x i1> %mask, <8 x double> undef) The challenges and
>risks are:
>       (1) %ptr0 and %ptr1 cannot be guaranteed to be the same. I know the
>LoopStrengthReduce pass can modified the pointer. Then we need to check
>that
>%ptr0 and %ptr1 pointed to the same address.
>       (2) There could be other instructions between them, so that we also
need
>to do memory dependence check.
>       (3) One Intrinsic could be moved to another basic block, then it
becomes
>analysis across basic block. The codegen can only cover one basic block.
>       (4) As Renato says, even when one intrinsic is missing, the
AArch64/ARM
>still can match the left to ldN and it is beneficial. But when one Store
intrinsic
>is missing (E.g. move to another basic block), as the AArch64/ARM doesn't
>have masked store, matching the left intrinsics is not beneficial.
>
>On the contrary, the "solution B" is to use one intrinsic for the whole
>interleaved accesses. Matching one-to-one or one-to-N is too much easier
than
>matching N-to-one.
>I admit the problem of my proposed intrinsics is that we need a lot of
intrinsics
>for ld2/ld3/ld4/ld5/..., which seems not reasonable.
>
>Then how about to use one indexed load/store? I think your previous
proposal
>about Indexed Load/Store intrinsics are interesting. For the odd-even
examples,
>we can match to indexed load intrinsic like:
>        %result = call <8 x i32> @llvm.uindex.load(i32 %ptr, <0, 2, 4, 6,
1, 3, 5, 7>)
>        %odd = shufflevector %result, UNDEF, <0, 1, 2, 3>
>        %even = shufflevector %result, UNDEF, <4, 5, 6, 7>
>
>The problem is about load/store 3 interleaved vectors. We don't have type
like
><12 x i32>. One way is to use the masked indexed load/store like:
>           %result = shufflevector <4 x i32> %V0, %V1, <0, 1, 2, 3, 4, 5,
6,
>7>
>           %result1 = sufflevector <4 x i32> %V2, UNDEF, <0, 1, 2, 3,
undef, undef,
>undef, undef>
>           %result2 = sufflevector <4 x i32> %result, %result1, <0, 1, 2,
3, 4, 5, 6, 7,
>8, 9, 10, 11, undef, undef, undef, undef>
>           %result = call <16 x i32> @llvm.unidex.masked.store(i32 %ptr,
<0, 3, 6,
>9, 1, 4, 7, 9, 2, 5, 8, 10, undef, undef, undef, undef>, <true, ....,
false, false, false,
>false>) Another simpler way is to define new types like
><12 x i32>, <6 x i32>, so that we can still use no masked intrinsics like
>2/4 interleaved load/store. I'm not sure whether this is reasonable.
>
>What do you think?
>
>Thanks,
>-Hao
>
>>-----Original Message-----
>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>Sent: 2015年3月23日 20:23
>>To: Hao Liu; 'Arnold Schwaighofer'
>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>>Jiangning Liu; James Molloy; Adam Nemet
>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>>interleaved data accesses
>>
>>> Actually I think the intrinsics which are currently used in
>>> AArch64/ARM
>>backends are simpler. Example for 2 interleaved vector:
>>>        %result = call { <2 x double>, <2 x double> }
>>> @llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>>
>>It is simple, but
>>
>>1) It is not safe due to possible memory access after eof buffer
>>2) I don't want to load odd elements if I need only even - nobody says
>>that
>it
>>should be implemented by sequential loads with shuffle
>>3) What happens if stride is 3 or 4?
>>4) What happens if the block is predicated?
>>
>>To represent the interleaved load that you want to achieve with
>>suggested intrinsic, you need 2 calls %even = <8 x double>
>>@llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 0, i32 align, <8
>>x
>>i1> %mask, <8 x double> undef)
>>%odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2,
i32
>1,
>>i32 align, <8 x i1> %mask, <8 x double> undef)
>>
>>You can translate these 2 calls into one target specific on codegen
>>pass,
>if the
>>mask is "all true", of course.
>>
>>-  Elena
>>
>>
>>-----Original Message-----
>>From: Hao Liu [mailto:Hao.Liu at arm.com]
>>Sent: Monday, March 23, 2015 12:25
>>To: Demikhovsky, Elena; 'Arnold Schwaighofer'
>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>>Jiangning Liu; James Molloy; Adam Nemet
>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>>interleaved data accesses
>>
>>Hi Elena,
>>
>>>>-----Original Message-----
>>>>From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
>>>>Sent: 2015年3月23日 15:45
>>>>To: Hao Liu; 'Arnold Schwaighofer'
>>>>Cc: Hal Finkel; Nadav Rotem; Commit Messages and Patches for LLVM;
>>>>Jiangning Liu; James Molloy; Adam Nemet
>>>>Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>>>>interleaved data accesses
>>>>
>>>>I agree with Hao, that a bunch of loads and shuffles will be very
>>difficult to
>>>>handle.
>>>>For interleave factor 4 and vector 8, you'll need 4 masked loads and
>>>>3
>>shuffles,
>>>>that will never be gathered together in one or two target instruction.
>>>>
>>>>We also can consider an "interleave load" as a private case of gather
>>>>/
>>scatter,
>>>>but again, getting the stride and converting back to interleave-load
>>>>will
>>be
>>>>cumbersome.
>>>>
>>>>I think that we should go for llvm-common-target intrinsic form till
>>>>the CodeGen.
>>>>
>>>>I propose to add a mask of control flow as a parameter to the
>>>>intrinsic,
>>like
>>>>llvm.masked.load/store in order to allow efficient vectorization of
>>predicated
>>>>basic block.
>>>><8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride,
>>>>i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)
>>>>
>>[Hao Liu]
>>I'm curious about how to use this intrinsic to represent interleaved load.
>>Do you mean the interleaved elements are in the result vector like
>>       <8 x double>: A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7] If
>>this
>is true. To
>>get two vectors with odd and even elements, we need two SHUFFLE_VECTORs
>>like:
>>       %result = <8 x double> @llvm.interleave.load.v8f64(double * %ptr,
>>...)      // A[0], A[2], A[4], A[6], A[1], A[3], A[5], A[7]
>>       %even_elements = shufflevector <8 x double> %result, UNDEF, <4 x
>i32>
>><0, 1, 2, 3>
>>       %odd_elements = shufflevector <8 x double> %result, UNDEF, <4 x
>>i32>
><4,
>>5, 6, 7>
>>        // Operations on %even_elements and %odd_elements.
>>Then how about the interleaved store, it seems we also need
>>shufflevectors
>to
>>combine into a big vector and call interleave.store.
>>
>>Actually I think the intrinsics which are currently used in AArch64/ARM
>>backends are simpler. Example for 2 interleaved vector:
>>        %result = call { <2 x double>, <2 x double> }
>>@llvm.aarch64.ld2.v2f64(<2 x double>* ptr)
>>        %even_elements = extractvalue { <2 x double>, <2 x double> }
>%result, 0
>>        %odd_elements = extractvalue { <2 x double>, <2 x double> }
>%result,
>>1
>>I think extractvalue is simpler than shufflevector.
>>Also the interleaved store is simply only one intrinsic like:
>>         call void @llvm.aarch64.neon.st2.v2f64(<2 x double>* ptr, <2 x
>>double> %V0, <2 x double> %V1)
>>So I think maybe we can implement similar intrinsics .
>>
>>>>-  Elena
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>Intel Israel (74) Limited
>>
>>This e-mail and any attachments may contain confidential material for
>>the
>sole
>>use of the intended recipient(s). Any review or distribution by others
>>is
>strictly
>>prohibited. If you are not the intended recipient, please contact the
>sender and
>>delete all copies.
>>
>
>
>
>
>---------------------------------------------------------------------
>Intel Israel (74) Limited
>
>This e-mail and any attachments may contain confidential material for the
sole
>use of the intended recipient(s). Any review or distribution by others is
strictly
>prohibited. If you are not the intended recipient, please contact the
sender and
>delete all copies.
>