[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses
Hao.Liu at arm.com
Sun Mar 22 20:46:10 PDT 2015
See my comments below.
From: Hal Finkel [mailto:hfinkel at anl.gov]
Sent: 2015年3月21日 0:31
To: Hao Liu
Cc: llvm-commits at cs.uiuc.edu; Jiangning Liu; James Molloy; aschwaighofer at apple.com; Nadav Rotem; Elena Demikhovsky
Subject: Re: [RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses
Perhaps this is a silly question, but why do you need interleaved load/store to support this? If we know that we need to access A, A, A, A, can't we generate two vector loads, one for A[0...3], and one for A[4...7], and then shuffle the results together. You need to leave the vector loop one iteration early (so you don't access off the end of the original access range), but that does not seem like a big loss. If I'm right, then I'd love to see this implemented in a way that can take advantage of interleaved load/store on targets that support them, but not require target support.
See my previous mail. If the interleaved number is 2, we can use 4 IRs (2 loads/stores, 2 shuffles), which seems beneficial. The problem is if we use apart IRs for the target who support interleaved accesses, we will pay extra much more effort to combine them together and it is vulnerable.
I think one solution can be as follows:
(1) For the targets which don’t support interleaved access, it it is beneficial, we can generate apart IRs.
(2) For the targets which support interleaved access, we can generate one intrinsic.
As for the interleaved access intrinsics, different targets may behavior differently. For the ARM or AArch64 target, it can directly match one intrinsic into one ldN/stN. For X86, I think it can match one intrinsic into several indexed loads/stores in AVX (I’m not familiar with X86, but I think it can easily handle interleaved access intrinsics).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-commits