[Libclc-dev] [PATCH] Fix vload3/vstore3 to emit only one IR load

Fri Sep 25 14:32:52 PDT 2015

> On 25 Sep 2015, at 22:20, Matt Arsenault <Matthew.Arsenault at amd.com> wrote:
> 
> On 09/25/2015 02:13 PM, Jeroen Ketema wrote:
>> Hi Matt,
>> 
>> The IR below seem fishy to me: if we have
>> 
>> vload3(get_global_id(0), A)
>> 
>> then  the work item with the highest id is likely to access an element out of bounds of the array being passed in.
>> 
>> Also, does the store generate a store of 4 elements, or will that be precisely be 3 elements?
>> 
>> Jeroen
> The store is also emitted as a <4 x i32>. I'm not sure why clang is avoiding direct load/store of 3 vectors, but this seems like a clang bug to me.

Right, then something like:

__kernel foo(__global int *A) {
  int3 tmp = vload3(get_global_id(0), A);
  tmp += 1;
  vstore3(tmp, get_global_id(0), A);
}

will have a data race, e.g., on the 4th element of A. And reading the spec, I think it should be data race free "vload3 and vload_half3 read x, y, z components from address (p + (offset * 3))”.

The reason why the load and store are turned in to 4 element loads and stores with your new code, is because something like:

int3 B[n];

will have an element of padding after each 3rd element. Hence, you’re just reading or writing an additional undefined value, which is not overlapping with one of the next element of the array (so no data race), and the reasoning is probably that a 4 element load or store is likely to be more efficient than a 3-element one. See also Section 6.1.5
"Alignment of Types" in the OpenCL 1.2 spec.

Jeroen