[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Sun Mar 22 20:10:58 PDT 2015

Hi Arnold,

See my comments below.

>>-----Original Message-----
>>From: Arnold Schwaighofer [mailto:aschwaighofer at apple.com]
>>Sent: 2015年3月20日 23:57
>>To: Hao Liu
>>Cc: Hal Finkel; Nadav Rotem; Elena Demikhovsky; Commit Messages and
>>Patches for LLVM; Jiangning Liu; James Molloy; Adam Nemet
>>Subject: Re: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about
>>interleaved data accesses
>>
>>Hi Hao,
>>
>>thanks for working on this. I would love to see this happening. Previous
>>discussion on this happened on bug
>>https://llvm.org/bugs/show_bug.cgi?id=17677.
>>
>>I have not yet looked at your implementation in detail but your
description
>>sounds similar to what is described in this bug.
>>
>>As you have observed in 2. You also need to teach dependency checking
about
>>strided accesses.
>>
>>I would like to discuss the trade-off intrinsic vs emitting a vector load
(store)
>>and the necessary shuffling: you have not mentioned this third
possibility.
>>
>>Would it be hard to match such a combination in the backend?
>>
>>In your example we would have
>>
>>v1 = vload <4xi32>
>>v2 = vload <4xi32>
>>even_odd_vec = shuffle_vector v1, v2, <0,2,4,6, 1,3,5,7> even =
shuffle_vector
>>even_odd_vec, undef, <0,1,2,3> odd =  shuffle_vector even_odd_vec, undef,
>><4,5,6,7>
>>
>>I suspect that LLVM would take this apart to:
>>
>>v1 = vload <4xi32>
>>v2 = vload <4xi32>
>>even = shuffle_vector v1, v2, <0,2,4,6>
>>odd =  shuffle_vector v1, v2, <1,3,5,7>
>>
[Hao Liu]
Yeah, you are right when the interleave number is 2. We can use 4 IRs to
represent interleaved accesses of 2 vectors (Even though there are 2 more
IRs than the representation of 2 intrinsics, it is acceptable).
But the problem or IR representation becomes unacceptable when the
interleave number is 3 or 4. E.g. If we want 3 interleaved vectors:
      V0: A[0], A[3], A[6], A[9]
      V1: A[1], A[4], A[7], A[10]
      V2: A[2], A[5], A[8], A[11]
We use 3 loads:
      %v0 = vload <4xi32>               // A[0...3]
      %v1 = vload <4xi32>               // A[4...7]
      %v2 = vload <4xi32>               // A[8...11]
      %tmp0 = shuffle_vector %v0, %v1, <0,3,6,undef>
// A[0], A[3], A[6], UNDEF
      %v0_result = shuffle_vector %tmp1_v0, %tmp_v2, <0, 1, 2, 5>
// A[0], A[3], A[6], A[9]
      %tmp1_v1 = shuffle_vector %tmp_v0, %tmp_v1, <1,4,7,undef>         //
A[1], A[4], A[7], UNDEF
      %V1 = shuffle_vector %tmp1_v1, %tmp_v2, <0, 1, 2, 6>
// A[1], A[4], A[7], A[10]
      ...
There are too many IRs. To match so many IRs in the backend is too
difficulty and vulnerable. If one of the IR is moved or deleted,  other IRs
could fail to be matched to ldN/stN.
For more details about ldN/stN, see:
http://community.arm.com/groups/processors/blog/2010/03/17/coding-for-neon--
part-1-load-and-stores. ldN/stN may be useful for image handling or signal
handling.

>>
>>Which is not what you want because than you have to match two instructions
>>
>>even = shuffle_vector v1, v2, <0,2,4,6>
>>odd =  shuffle_vector v1, v2, <1,3,5,7>
>>
>>which could fail.
[Hao Liu]
Yes, that's my point. Combining several IRs/intrinsics are vulnerable.
Especially the LoopVectorizer can even unroll the vectorized loop and make
the IRs/intrinsics interleaved.
E.g. Combining
      interleave.store(ptr, v0, 0)                     // V0
      interleave.store(unrollPtr, v2, 0)            // V2
      interleave.store(ptr, v1, 1)                     // V1
      interleave.store(unrollPtr, v3, 1)            // V3
to
      aarch64.st2(ptr, v0, v1)
      aarch64.st2(unrollPtr, v2, v3)

The vulnerable things are like:
(1) one of the intrinsics may moved to another basic block, how to combine
them across basic blocks. One intrinsics/IRs may be deleted, how to combine
the left intrinsics/IRs.
(2) If we want to combine v0 and v1, v2 and v3, we also need to check the
memory dependence of v0 and v2, v2 and v1, ....
(3) The pointers used in each intrinsics may be changed (AFAIK,
LoopStrengthReduce can change the pointer representations, so I put the
backend combine pass before LSR pass), so we may also need check that
different pointers are pointing to the same address.
>>
>>To get around this you would really need an intrinsic that is not taken
apart,
>>like your solution B. Solution A has the same problem as just using
shuffles
>>(you are dependent on the pattern match succeeded, which might be foiled
by
>>intermediate transformation) so I don’t think it would be a good option.
>>
>>{v1_ty_1, ... ,v1_ty_n} = interleave_load_n v1, .., vN (and
interleaved_store)
>>
>>I don’t understand why Solution B is not target independent. You just
need to
>>lower this intrinsic to regular vector loads and shuffles on targets that
don’t
>>support it natively.
[Hao Liu]
I'm happy to see that you also think one intrinsic is better than apart
intrinsics/IRs. I'll work out a new patch about this.

>>
>>Aside: I saw in your patch that you are asking whether a target supports
it i.e
>>"isLegalInterleaveAccess(Type *DataTy)". This should not matter, rather
the
>>cost model should return appropriate costs for
“interleaved_stores/loads”.
>>Even if a target does not support such a instruction natively it might
still be
>>beneficial to vectorize with regular loads and shuffles.
[Hao Liu]
Oh, that is true if the interleave number is 2. Two vector loads and two
shuffles may still benefical and can enable the loop vectorization.

>>
>>
>>Thanks,
>>Arnold

Thanks,
-Hao