Vectorization of pointer PHI nodes

Arnold Schwaighofer aschwaighofer at apple.com
Mon Oct 14 11:31:55 PDT 2013


Renato, can you post the c code for the function and the assembly that gcc produces?

Your initial example could be well handled by vectorization of strided loops (and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that this is what happened). But the LLVM-IR you sent has a store of 0 in there ;) and strides by 4.


Thanks,
Arnold


Vectorization of strided loops:

I am using float as the example otherwise would get too long.

void f(float * restrict read, float * restrict write) {
  for (int i = 0; i < 256; i++) {
    float a1 = *read++ * 3.0;
    float a2 = *read++ * 4.0;
    float a3 = *read++ * 5.0;

    *write++ = a1;
    *write++ = a2;
    *write++ = a3;
  }


recognized as

  for (int i = 0; i < 256; i +=3) {
    float a1 = *read[i] * 3.0;
    float a2 = *read[i+1]* 4.0;
    float a3 = *read[i+2] * 5.0;

    write[i] = a1;
    write[i+1] = a2;
    write[i+2] = a3;
  }

=> loop vectorize with a factor of 4, recognizing that after we vector-unroll the loop by four the scattered accesses from different lines (read[i]..read[i+9+2]) are consecutive and we can efficiently vectorized these accesses (3 vector loads plus interleaves which on arm we can do with VLD3.8):

  for (int i = 0; i < 256; i +=12) {
    float a1 = *read[i] * 3.0; 
    float a1_2 = *read[i+3] * 3.0;
    float a1_3 = *read[i+6] * 3.0;
    float a1_4 = *read[i+9] * 3.0

    float a2 = *read[i+1]* 4.0;
    float a2_2 = *read[i+3+1]* 4.0;
    …

    float a3 = *read[i+2] * 5.0;
    float a3_2 = *read[i+3+2] * 5.0;

    write[i] = a1;
    write[i+3] = a1_2;
    …

    write[i+1] = a2;
    write[i+1+3] = a2_2;
    ...
  }


 VLD3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i]
 a1..a1_4 = VMUL a1..a1_4, #3.0
 a2..a2_4 = VMUL a2..a2_4, #4.0
 a3..a3_4 = VMUL a3..a3_4, #5.0
 VST3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i]



On Oct 14, 2013, at 12:15 PM, Nadav Rotem <nrotem at apple.com> wrote:

> This is almost ideal for SLP vectorization, except for two problems:
> 
> 1. We have 4 stores to consecutive locations, but the last element is the constant zero, and not an additional SUB.   At the moment we don’t have support for idempotence operations, but this is something that we should add. 
> 
> 2. The values that we are subtracting come from 3 loads.  We usually load 4 elements from memory, or scalarize the inputs (we don’t support masked loads on AVX512).  
> 
> Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop Vectorizer ? 
> 
> Thanks,
> Nadav 
>   
> 
> 
> On Oct 14, 2013, at 10:09 AM, Renato Golin <renato.golin at linaro.org> wrote:
> 
>> On 14 October 2013 18:03, Nadav Rotem <nrotem at apple.com> wrote:
>> This also looks like a form of SLP vectorization.
>> 
>> Yes. Would it be more beneficial to make it a BB-only pass? It seems that, independent of that, it would be beneficial to have pointer reduction variables.
>> 
>> 
>> I assume that you meant to write (*read++). Basically, we have a wide load and a wide store and some operations on ABC.
>> 
>> yes.
>> 
>> 
>> Can you send the IR for this code ?
>> 
>> Unoptimized and optimized version, with the latter being exactly what the vectorizer will see at O3 (I dumped from inside the debugger and it was identical).
>> 
>> cheers,
>> --renato
>> 
>> 
>> <vect-pointer-test.zip>
> 





More information about the llvm-commits mailing list