[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Fri Mar 20 09:43:51 PDT 2015

> On Mar 20, 2015, at 9:30 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> 
> From: "Hao Liu" <Hao.Liu at arm.com>
> To: aschwaighofer at apple.com, hfinkel at anl.gov, "Nadav Rotem" <nrotem at apple.com>, "Elena Demikhovsky" <elena.demikhovsky at intel.com>
> Cc: llvm-commits at cs.uiuc.edu, "Jiangning Liu" <Jiangning.Liu at arm.com>, "James Molloy" <James.Molloy at arm.com>
> Sent: Friday, March 20, 2015 6:47:52 AM
> Subject: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses
> 
> Hi,
>  
> There are two patches attached can achieve this goal:
>      LoopVectorize-InterleaveAccess.patch teaches Loop Vectorizer about interleaved data access and generate target independent intrinsics for each load/store:
>      AArch64Backend-MatchIntrinsics.patch match several target independent intrinsics into one AArch64 ldN/stN intrinsics, so that AArch64 backend can generate ldN/stN instructions.
>  
> Currently, LoopVectorize can vectorize consecutive accesses well. It can vectorize loops like
>     for (int i = 0; i < n; i++)
>          sum += R[i];
>  
> But it doesn't handle strided access well. Interleaved access is a subset of strided access. Example for interleaved access:
>     for (int i = 0; i < n; i++) {
>          int even = A[2*i];
>          int odd = A[2*i + 1];
>          // do something with odd & even.
>     }
> To vectorize such case, we need two vectors: one with even elements, another with odd elements. To gather even elements, we need several scalar loads for "A[0], A[2], A[4], ...", and several INSERT_ELEMENTs to combine them together. The cost is very high and will usually prevent loop vectorization on such case.
> 
> Perhaps this is a silly question, but why do you need interleaved load/store to support this? If we know that we need to access A[0], A[2], A[4], A[6], can't we generate two vector loads, one for A[0...3], and one for A[4...7], and then shuffle the results together. You need to leave the vector loop one iteration early (so you don't access off the end of the original access range), but that does not seem like a big loss. If I'm right, then I'd love to see this implemented in a way that can take advantage of interleaved load/store on targets that support them, but not require target support.
> 

You don’t need interleaved loads/stores. You can represent this as vector memory operations and shuffles. The disadvantage of this representation is that you need to reconstruct the interleaved instruction from a series of instructions in the backend which people argue can be brittle.

As you note, neither representation should prevent targets who don’t natively support interleave load/stores from simulating them with load/stores and shuffles. The question is which representation do we want to choose (see also my other email): do we want an intrinsic which we lower on platforms that don’t support it to vector mem ops and shuffles or do we represent this as vector mem ops and shuffles and hope the backend can reconstruct the interleaved access (on platforms that support it).