[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Mon Mar 23 00:45:27 PDT 2015

I agree with Hao, that a bunch of loads and shuffles will be very difficult to handle.
For interleave factor 4 and vector 8, you'll need 4 masked loads and 3 shuffles, that will never be gathered together in one or two target instruction.

We also can consider an "interleave load" as a private case of gather / scatter, but again, getting the stride and converting back to interleave-load will be cumbersome.  

I think that we should go for llvm-common-target intrinsic form till the CodeGen.

I propose to add a mask of control flow as a parameter to the intrinsic, like llvm.masked.load/store in order to allow efficient vectorization of predicated basic block.
<8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 %stride, i32 %first_ind, i32 align, <8 x i1> %mask, <8 x double> %PathThru)

-  Elena

-----Original Message-----
From: Hao Liu [mailto:Hao.Liu at arm.com] 
Sent: Monday, March 23, 2015 05:11
To: 'Arnold Schwaighofer'
Cc: Hal Finkel; Nadav Rotem; Demikhovsky, Elena; Commit Messages and Patches for LLVM; Jiangning Liu; James Molloy; Adam Nemet
Subject: RE: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Hi Arnold,

See my comments below.

>>-----Original Message-----
>>From: Arnold Schwaighofer [mailto:aschwaighofer at apple.com]
>>Sent: 2015年3月20日 23:57
>>To: Hao Liu
>>Cc: Hal Finkel; Nadav Rotem; Elena Demikhovsky; Commit Messages and 
>>Patches for LLVM; Jiangning Liu; James Molloy; Adam Nemet
>>Subject: Re: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about 
>>interleaved data accesses
>>
>>Hi Hao,
>>
>>thanks for working on this. I would love to see this happening. 
>>Previous discussion on this happened on bug 
>>https://llvm.org/bugs/show_bug.cgi?id=17677.
>>
>>I have not yet looked at your implementation in detail but your
description
>>sounds similar to what is described in this bug.
>>
>>As you have observed in 2. You also need to teach dependency checking
about
>>strided accesses.
>>
>>I would like to discuss the trade-off intrinsic vs emitting a vector 
>>load
(store)
>>and the necessary shuffling: you have not mentioned this third
possibility.
>>
>>Would it be hard to match such a combination in the backend?
>>
>>In your example we would have
>>
>>v1 = vload <4xi32>
>>v2 = vload <4xi32>
>>even_odd_vec = shuffle_vector v1, v2, <0,2,4,6, 1,3,5,7> even =
shuffle_vector
>>even_odd_vec, undef, <0,1,2,3> odd =  shuffle_vector even_odd_vec, 
>>undef, <4,5,6,7>
>>
>>I suspect that LLVM would take this apart to:
>>
>>v1 = vload <4xi32>
>>v2 = vload <4xi32>
>>even = shuffle_vector v1, v2, <0,2,4,6> odd =  shuffle_vector v1, v2, 
>><1,3,5,7>
>>
[Hao Liu]
Yeah, you are right when the interleave number is 2. We can use 4 IRs to represent interleaved accesses of 2 vectors (Even though there are 2 more IRs than the representation of 2 intrinsics, it is acceptable).
But the problem or IR representation becomes unacceptable when the interleave number is 3 or 4. E.g. If we want 3 interleaved vectors:
      V0: A[0], A[3], A[6], A[9]
      V1: A[1], A[4], A[7], A[10]
      V2: A[2], A[5], A[8], A[11]
We use 3 loads:
      %v0 = vload <4xi32>               // A[0...3]
      %v1 = vload <4xi32>               // A[4...7]
      %v2 = vload <4xi32>               // A[8...11]
      %tmp0 = shuffle_vector %v0, %v1, <0,3,6,undef> // A[0], A[3], A[6], UNDEF
      %v0_result = shuffle_vector %tmp1_v0, %tmp_v2, <0, 1, 2, 5> // A[0], A[3], A[6], A[9]
      %tmp1_v1 = shuffle_vector %tmp_v0, %tmp_v1, <1,4,7,undef>         //
A[1], A[4], A[7], UNDEF
      %V1 = shuffle_vector %tmp1_v1, %tmp_v2, <0, 1, 2, 6> // A[1], A[4], A[7], A[10]
      ...
There are too many IRs. To match so many IRs in the backend is too difficulty and vulnerable. If one of the IR is moved or deleted,  other IRs could fail to be matched to ldN/stN.
For more details about ldN/stN, see:
http://community.arm.com/groups/processors/blog/2010/03/17/coding-for-neon--
part-1-load-and-stores. ldN/stN may be useful for image handling or signal handling.

>>
>>Which is not what you want because than you have to match two 
>>instructions
>>
>>even = shuffle_vector v1, v2, <0,2,4,6> odd =  shuffle_vector v1, v2, 
>><1,3,5,7>
>>
>>which could fail.
[Hao Liu]
Yes, that's my point. Combining several IRs/intrinsics are vulnerable.
Especially the LoopVectorizer can even unroll the vectorized loop and make the IRs/intrinsics interleaved.
E.g. Combining
      interleave.store(ptr, v0, 0)                     // V0
      interleave.store(unrollPtr, v2, 0)            // V2
      interleave.store(ptr, v1, 1)                     // V1
      interleave.store(unrollPtr, v3, 1)            // V3
to
      aarch64.st2(ptr, v0, v1)
      aarch64.st2(unrollPtr, v2, v3)

The vulnerable things are like:
(1) one of the intrinsics may moved to another basic block, how to combine them across basic blocks. One intrinsics/IRs may be deleted, how to combine the left intrinsics/IRs.
(2) If we want to combine v0 and v1, v2 and v3, we also need to check the memory dependence of v0 and v2, v2 and v1, ....
(3) The pointers used in each intrinsics may be changed (AFAIK, LoopStrengthReduce can change the pointer representations, so I put the backend combine pass before LSR pass), so we may also need check that different pointers are pointing to the same address.
>>
>>To get around this you would really need an intrinsic that is not 
>>taken
apart,
>>like your solution B. Solution A has the same problem as just using
shuffles
>>(you are dependent on the pattern match succeeded, which might be 
>>foiled
by
>>intermediate transformation) so I don’t think it would be a good option.
>>
>>{v1_ty_1, ... ,v1_ty_n} = interleave_load_n v1, .., vN (and
interleaved_store)
>>
>>I don’t understand why Solution B is not target independent. You just
need to
>>lower this intrinsic to regular vector loads and shuffles on targets 
>>that
don’t
>>support it natively.
[Hao Liu]
I'm happy to see that you also think one intrinsic is better than apart intrinsics/IRs. I'll work out a new patch about this.

>>
>>Aside: I saw in your patch that you are asking whether a target 
>>supports
it i.e
>>"isLegalInterleaveAccess(Type *DataTy)". This should not matter, 
>>rather
the
>>cost model should return appropriate costs for
“interleaved_stores/loads”.
>>Even if a target does not support such a instruction natively it might
still be
>>beneficial to vectorize with regular loads and shuffles.
[Hao Liu]
Oh, that is true if the interleave number is 2. Two vector loads and two shuffles may still benefical and can enable the loop vectorization.

>>
>>
>>Thanks,
>>Arnold

Thanks,
-Hao

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.