[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Fri Mar 20 11:44:05 PDT 2015

Sent from my iPhone

> On Mar 20, 2015, at 10:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> ----- Original Message -----
>> From: "Arnold Schwaighofer" <aschwaighofer at apple.com>
>> To: "Hao Liu" <Hao.Liu at arm.com>
>> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Nadav Rotem" <nrotem at apple.com>, "Elena Demikhovsky"
>> <elena.demikhovsky at intel.com>, "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>, "Jiangning Liu"
>> <Jiangning.Liu at arm.com>, "James Molloy" <James.Molloy at arm.com>, "Adam Nemet" <anemet at apple.com>
>> Sent: Friday, March 20, 2015 10:56:55 AM
>> Subject: Re: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses
>> 
>> Hi Hao,
>> 
>> thanks for working on this. I would love to see this happening.
>> Previous discussion on this happened on bug
>> https://llvm.org/bugs/show_bug.cgi?id=17677.
>> 
>> I have not yet looked at your implementation in detail but your
>> description sounds similar to what is described in this bug.
>> 
>> As you have observed in 2. You also need to teach dependency checking
>> about strided accesses.
>> 
>> I would like to discuss the trade-off intrinsic vs emitting a vector
>> load (store) and the necessary shuffling: you have not mentioned
>> this third possibility.
>> 
>> Would it be hard to match such a combination in the backend?
>> 
>> In your example we would have
>> 
>> v1 = vload <4xi32>
>> v2 = vload <4xi32>
>> even_odd_vec = shuffle_vector v1, v2, <0,2,4,6, 1,3,5,7>
>> even = shuffle_vector even_odd_vec, undef, <0,1,2,3>
>> odd =  shuffle_vector even_odd_vec, undef, <4,5,6,7>
>> 
>> I suspect that LLVM would take this apart to:
>> 
>> v1 = vload <4xi32>
>> v2 = vload <4xi32>
>> even = shuffle_vector v1, v2, <0,2,4,6>
>> odd =  shuffle_vector v1, v2, <1,3,5,7>
>> 
>> 
>> Which is not what you want because than you have to match two
>> instructions
>> 
>> even = shuffle_vector v1, v2, <0,2,4,6>
>> odd =  shuffle_vector v1, v2, <1,3,5,7>
>> 
>> which could fail.
>> 
>> To get around this you would really need an intrinsic that is not
>> taken apart, like your solution B. Solution A has the same problem
>> as just using shuffles (you are dependent on the pattern match
>> succeeded, which might be foiled by intermediate transformation) so
>> I don’t think it would be a good option.
>> 
>> {v1_ty_1, ... ,v1_ty_n} = interleave_load_n v1, .., vN
>> (and interleaved_store)
>> 
>> I don’t understand why Solution B is not target independent. You just
>> need to lower this intrinsic to regular vector loads and shuffles on
>> targets that don’t support it natively.
>> 
>> Aside: I saw in your patch that you are asking whether a target
>> supports it i.e "isLegalInterleaveAccess(Type *DataTy)". This should
>> not matter, rather the cost model should return appropriate costs
>> for “interleaved_stores/loads”. Even if a target does not support
>> such a instruction natively it might still be beneficial to
>> vectorize with regular loads and shuffles.
> 
> I think that it really *almost* does not matter. If we want to access A[0], A[2], A[4], A[6] using regular vector loads and shuffles, we can load A[0...3] and A[4...7], and then shuffle the results. But, we need to insure that we don't access off of the end of the range of the original access pattern.

I agree that we have to make sure that we only acesses memory that we would have originally.

We know the loop iterates vector length times and we know we access 2*i and 2i+1

The preceding analysis would have to make sure that the whole horizontal access has to be accessed: i.e 2•i and 2*i+1 .

I was assuming this.  you aren't and that is a good point.

You are right if we don't want to assume whole horizontal access we have to use predication or an intrinsic per access.

Exiting early: I don't think we can assume that if there's is an access at a[i] and a[i+n] any access in between would not trap which leaves us with predication or proving whole horizontal access (which would preclude a loop only accessing a[2i]).

> So we have two options:
> 
> 1. Leave the vector loop one iteration early
> 2. Use predicated (masked) loads/stores
> 
> If we have neither predicated loads/stores, nor interleaved loads/stores, then (1) is the best option. If we have either of these things, then we don't want to leave the vector loop early (in some cases, this will be the difference between having a scalar tail loop and not having one). So I think that the vectorizer does want to know so it can decide what to do with the loop bounds and the tail loop. 
> 
> For general pattern matching, however, because of the access range issue, we probably can't chose a canonical form involving regular access and shuffles. We could use the predicated/masked load/store intrinsics plus shuffles.
> 
> -Hal
> 
>> 
>> Thanks,
>> Arnold
>> 
>> 
>>> On Mar 20, 2015, at 4:47 AM, Hao Liu <Hao.Liu at arm.com> wrote:
>>> 
>>> Hi,
>>> 
>>> There are two patches attached can achieve this goal:
>>>   LoopVectorize-InterleaveAccess.patch teaches Loop Vectorizer
>>>   about interleaved data access and generate target independent
>>>   intrinsics for each load/store:
>>>   AArch64Backend-MatchIntrinsics.patch match several target
>>>   independent intrinsics into one AArch64 ldN/stN intrinsics,
>>>   so that AArch64 backend can generate ldN/stN instructions.
>>> 
>>> Currently, LoopVectorize can vectorize consecutive accesses well.
>>> It can vectorize loops like
>>>  for (int i = 0; i < n; i++)
>>>       sum += R[i];
>>> 
>>> But it doesn't handle strided access well. Interleaved access is a
>>> subset of strided access. Example for interleaved access:
>>>  for (int i = 0; i < n; i++) {
>>>       int even = A[2*i];
>>>       int odd = A[2*i + 1];
>>>       // do something with odd & even.
>>>  }
>>> To vectorize such case, we need two vectors: one with even
>>> elements, another with odd elements. To gather even elements, we
>>> need several scalar loads for "A[0], A[2], A[4], ...", and several
>>> INSERT_ELEMENTs to combine them together. The cost is very high
>>> and will usually prevent loop vectorization on such case. Some
>>> backend like AArch64/ARM support interleaved load/store: ldN/stN
>>> (N is 2/3/4). And I know X86 can also support similar operations.
>>> One ld2 can load two vectors: one is with even elements, another
>>> is with only odd elements. So that this case can be vectorized
>>> into AArch64 instructions:
>>>   LD2 { V0, V1 } [X0]
>>>   // V0 contains even elements. Do something related to even
>>>   elements with V0.
>>>   // V1 contains odd elements. Do something related to odd
>>>   elements with V1.
>>> 
>>> 
>>> 1. Design
>>> My design is to follow current Loop Vecotirzer three phases.
>>> (1) Legality Phase:
>>>  (a). Collect all the constant strided accesses except
>>>  consecutive accesses.
>>>  (b). Collect the load/store accesses with the same Stride, Base
>>>  pointer.
>>>  (c). Fine the consecutive chains in (b). If the number of
>>>  accesses in one chain are equal to the Stride, they are
>>>  interleaved accesses.
>>> Example for the case about even and odd. We can find two loads for
>>> even and odd elements. The strides are both 2. They are also
>>> consecutive. So they are recorded as interleaved accesses.
>>> 
>>> (2) Profitability Phase:
>>>   Add a target hook to calculate the cost. Currently the cost is
>>>   1. Currently this won't affect to much about the result. So I
>>>   didn't do too much work in this phase.
>>> 
>>> (3) Transform Phase:
>>>   As there is no IR for interleaved data, I think we should use
>>>   intrinsics. The problem is that the relationship is "N to
>>>   one". I.E. Several loads/stores to one ldN/stN instructions.
>>>   There is already ldN/stN intrinsics in AArch64/ARM backend
>>>   such llvm.aarch64.neon.ldN, which is like "call { <4 x i32>,
>>>   <4 x i32>} llvm.aarch64.neon.ld2.v4i32()". In the middle end,
>>>   there are two IR loads.
>>> Need to think a way to match two loads to one target specific
>>> intrinsic. I think there are two ways:
>>>   (a). Two steps for middle end and backend. 1st step is to
>>>   match each loads/stores to one target independent intrinsic
>>>   in the loop vectorize. 2nd step is to match several
>>>   intrinsics into one ldN/stN intrinsic. This is the choise of
>>>   my attached patch. For the above odd-even example, it will
>>>   generate two intrinsics in the loop vectorization:
>>>              "%even-elements = call <4 x i32>
>>>              @llvm.interleave.load.v4i32",
>>>              "%odd-elements = call <4 x i32>
>>>              @llvm.interleave.load.v4i32".
>>> A backend pass will combine them together into one intrinsic:
>>>              "%even-odd-elements = call { <4 x i32>, <4 x i32> }
>>>              @llvm.aarch64.neon.ld2.v4i32"
>>> But I think the backend pass is vulnerable and diffecult to
>>> implement. It will fail to combine if one load is missing, or if
>>> one load is moved to another basic block. Also I haven't check
>>> about memory dependency.
>>>   (b). One step only for middle end. We can match several
>>>   load/stores into one ldN/stN like target independent
>>>   intrinsic. So that the AArch64/ARM backend only needs slight
>>>   modification on replacing the current used intrinsic to the
>>>   new independent intrinsic. This needs to introduce a new
>>>    intrinsic such as "{ <4 x i32>, <4 x i32>}
>>>   llvm.interleaved.load.v4i32()".
>>> 
>>>    Actually I prefer solution (b), which is easier to be
>>>    implemented and stronger than solution (a). But solution (a)
>>>    seems more target independent. How do you guys think?
>>> 
>>> 2. Test
>>> I've test the attached patch with llvm-test-suit, SPEC2000,
>>> SPEC2006, EEMBC, Geekbench on AArch64 machine. They all can pass.
>>> But the performance is not affected. Some specific benchmarks like
>>> EEMBC rgbcmy and EEMBC rgbyiq are expected to have several times
>>> of acceleration. The reason is that there are other issues prevent
>>> vectorization opportunities. Some known issues are:
>>>     (1). Too many unnecessary runtime checkings (The interleaved
>>>     accesses are compared with each other).
>>>     (2). Store-Load Forward checking (Doesn't consider about
>>>     strided accesses).
>>>    (3). Type promotion issue (i8 is illegal but <16 x i8> is
>>>    legal. i8 is promoted to i32 so the extend and truncate
>>>    operations increase the total cost).
>>>     (4). The Vector Factor is selected according to the widest
>>>     type. (If there are both i8 and i32, we select a small
>>>     factor according to i32 rather than according to i8).
>>> Anyway. We can fix them in the future and get performance
>>> improvements.
>>> 
>>> What's your oppions on the solution? I'm still hesitating about the
>>> transform phase.
>>> 
>>> Thanks,
>>> -Hao
>>> <LoopVectorize-InterleaveAccess.patch><AArch64Backend-MatchIntrinsics.patch>
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory