[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

Renato Golin via llvm-dev llvm-dev at lists.llvm.org
Mon Oct 10 13:14:12 PDT 2016


On 10 October 2016 at 19:39, Alina Sbirlea <alina.sbirlea at gmail.com> wrote:
> Now, for ARM archs Halide is currently generating explicit VSTn intrinsics,
> with some of the patterns I described, and I found no reason why Halide
> shouldn't generate a single shuffle, followed by a generic vector store and
> rely on the interleaved access pass to generate the right intrinsic.

IIRC, the shuffle that unscrambles the interleaved pattern <0, 4, 8, 1
...> -> <0, 1, 2 ...> is how the strided algorithm works, so that the
back-end can match the patterns and emit a VSTn / STn because the
access is "sequential" in a n-hop way.

If the shuffle doesn't present itself in that way, the back-end
doesn't match and you end up with a long list of VMOVs.

Also, the vectorisers today make sure that the sequence is continuous
and contiguous, which doesn't seem to be a hard requirement for
Halide. I don't think there's a better reason than "we haven't thought
about the general case yet".

One way to test the back-end pattern matching is to emit textual IR
and manually change it, removing the intrinsics, or changing the
shuffles and see what happens after `opt`.


> Performance-wise, it is worth using the VSTns in the scenarios they
> encounter, it's mostly a question of where they get generated.

I'm confused. AFAIK, VSTn is AArch32 while STn is AArch64, and for the
one lane variant, the latency and throughput seem to be identical.


> If the alignment is not an issue, it simplifies things.

Except on rare cases (which are not pertinent to the case at hand),
ARMv8 handles unaligned loads and stores without penalties.

cheers,
--renato


More information about the llvm-dev mailing list