[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses

Mon Mar 23 14:24:53 PDT 2015

On 23 March 2015 at 12:23, Demikhovsky, Elena
<elena.demikhovsky at intel.com> wrote:
> 1) It is not safe due to possible memory access after eof buffer

Though handling EOF with your intrinsic proposal means predicating
every mask if the range is unknown, right? Can't this be handled in
the same way we handle EOF with the current vectorization factor, but
doubled/tripled/etc.?

> 2) I don't want to load odd elements if I need only even - nobody says that it should be implemented by sequential loads with shuffle

Exactly. Even if there is no other [x+1] operation, VLDN is still beneficial.

> 3) What happens if stride is 3 or 4?

I'm assuming ld2, ld3, ld4m, etc. I agree this is a fragile design.

> To represent the interleaved load that you want to achieve with suggested intrinsic, you need 2 calls
> %even = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 0, i32 align, <8 x i1> %mask, <8 x double> undef)
> %odd   = <8 x double> @llvm.interleave.load.v8f64(double * %ptr, i32 2, i32 1, i32 align, <8 x i1> %mask, <8 x double> undef)

I like this.

> You can translate these 2 calls into one target specific on codegen pass, if the mask is "all true", of course.

If the first/last elements of the mask are false you can force a head/tail loop.

I'm also assuming that the vectorizer will only emit such intrinsics
if the target supports it. Otherwise, it'd be hard for the back-ends
that know nothing about interleaved access to untangle the mess and
still generate acceptable code.

cheers,
--renato