[llvm-dev] [arm, aarch64] Alignment checking in interleaved access pass

Sat Oct 8 06:26:45 PDT 2016

On 19 September 2016 at 21:52, Alina Sbirlea via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> The question I'm trying to get answered if there should have been an
> alignment check for the original pass, and, similarly, if there should be an
> expanded one for the more general pattern.

Hi Alina,

IIRC, the initial implementation was very simple and straightforward
to make use of VLDn instructions on ARM/AArch64 NEON.

Your patterns allow simple vector instructions in the trivial case,
but not in the cases where VLDn would make a difference.

The examples were:

for (i..N)
  out[i] = in[i] * Factor; // R
  out[i+1] = in[i+1] * Factor; // G
  out[i+2] = in[i+2] * Factor; // B

This pattern is easily vectorised on most platforms, since loads, muls
and stores are the exact same operation. which can be combined.

for (i..N)
  out[i] = in[i] * FactorR; // R
  out[i+1] = in[i+1] * FactorG; // G
  out[i+2] = in[i+2] * FactorB; // B

This still can be vectorised easily, since the Factor vector can be
easily constructed.

for (i..N)
  out[i] = in[i] + FactorR; // R
  out[i+1] = in[i+1] - FactorG; // G
  out[i+2] = in[i+2] * FactorB; // B

Now it gets complicated, because the operations are not the same. In
this case, VLDn helps, because you shuffle [0, 1, 2, 3, 4, 5] -> VADD
[0, 3] + VSUB [1, 4] + VMUL [2, 5].

Your case seems to be more like:

for (i..N)
  out[i] = in[i] * FactorR; // R
  out[i+4] = in[i+4] * FactorG; // G
  out[i+8] = in[i+8] * FactorB; // B

In which VLDn won't help, but re-shuffling the vectors like the second
case above will.

Even this case:

for (i..N)
  out[i] = in[i] + FactorR; // R
  out[i+4] = in[i+4] - FactorG; // G
  out[i+8] = in[i+8] * FactorB; // B

can work, if the ranges are not overlapping. So, [0, 4, 8] would work
on a 4-way vector, but [0, 2, 4] would only work on a 2-way vector.

> In the example above, I was looking to check if the data at positions 4, 16,
> 32 is aligned, but I cannot get a clear picture on the impact on performance

On modern ARM and AArch64, misaligned loads are not a problem. This is
true at least from A15 onwards, possibly A9 (James may know better).

If your ranges overlap, you may be forced to reduce the vectorisation
factor, thus reducing performance, but the vectoriser should be able
to pick that up from the cost analysis pass (2-way vs 4-way).

> Also, some preliminary alignment checks I added break some ARM tests (and
> not their AArch64 counterparts). The cause is getting "not fast" from
> allowsMisalignedMemoryAccesses, from checking hasV7Ops.

What do you mean by "break"? Bad codegen? Slower code?

> Side question for Tim and other ARM folks, could I get a recommendation on
> reading material for performance tuning for the different ARM archs?

ARM has a list of manuals on each core, including optimisation guides:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.cortexa/index.html

cheers,
--renato